Focusing on the recipients

We can first focus on the features that we extracted as informative in the recipients, to see if, taken together, they would be more informative than when working on one data type alone.

Tolerant versus non tolerant recipients

Since looking at differences between the three groups consisting of primary, secondary and non tolerant patients at the same time wasn’t feasible, we divided our problem in two distinct sub-problems. We first focus on the features that seemed to play a role when trying to disinguish non tolerant from tolerant recipients.

We first need to gather the features that were extracted from the three data sources:

  • CyTOF features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.

  • Metabolomics features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.

  • Transcriptomics features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.

I extract informations on the recipients group, age, gender, … to be able to use this information later on.

## [1] "0"
## [1] "1"
##          Id.Cryostem.R            GROUP GENDER        DOB        DOG
## D1_TOL           D2497     non_tolerant      F 1962-07-06       <NA>
## R1_2_TOL         R2498     non_tolerant      F 1954-12-05 2014-08-06
## D2_NoTOL          D284     non_tolerant      F 1967-04-03       <NA>
## R2_NoTOL          R385     non_tolerant      M 1953-12-16 2013-05-02
## D3_TOL           D1502 primary_tolerant      F 1990-04-03       <NA>
## R3_1_TOL         R1503 primary_tolerant      M 1991-05-19 2014-01-02
##          gender_comp age_recip cmv_comp
## D1_TOL            FF     21794       01
## R1_2_TOL          FF     21794       01
## D2_NoTOL          FM     21687      NA1
## R2_NoTOL          FM     21687      NA1
## D3_TOL            FM      8264       11
## R3_1_TOL          FM      8264       11

I add a prefix (“cyt_”, “rna_” or “met_”) in front of the features names, and I then merge all features together:

## [1] "We have information on the CyTOF, metabolomics and transcriptomics data for 23 patients in the St Louis cohort."

Volcano plot analysis

We can do a volcano plot on the CyTOF features metabolites and genes of the non versus tolerant recipients:

Correlation analysis

#pdf("~/Desktop/correlationplot_integration_recip_TNT.pdf", width = 16, height = 16)
gr <- plot_correlations(mat4volcano, compar = "TNT")
#dev.off()

Random forest modeling

To see if the features that we selected can help us to clssify the patients, we apply a 5 fold crosse validation setting on the 23 patients for which we have CyTOF, metabolomics and transcriptomics information.

set.seed(1)
cv_id <- list(c(1:5), c(6:10), c(11:15), c(16:20), c(21:23))
cv_accuracies <- lapply(seq_along(cv_id), function(idx){
  train_set <- mat4volcano[-cv_id[[idx]], c("group", colnames(ft_integ)[-1])]
  test_set <- mat4volcano[cv_id[[idx]], c(colnames(ft_integ)[-1])]
  true_labels <- mat4volcano[cv_id[[idx]], "group"]
  rf <- ranger::ranger(group~., train_set, num.trees = 5000,
                           importance = "impurity")
  pred <- predict(rf, test_set)
  acc <- ( length(which(pred$predictions == true_labels))/length(true_labels) )
})

accuracy <- mean(unlist(cv_accuracies))
accuracy
## [1] 0.88

Principal component analysis

We can first generate a PCA on the data:

The first principal component seems to hold much more information that the others

We can investigate how much of the principal components are driven by the different data types.

The RNAseq data variability was captured much more than the other data types. This is most probably due to the fact that we selected 278 genes, against 42 metabolites and 24 CyTOF features only.

We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC4 (and slightly PC8 and PC10) also look correlated with tolerance.

We can then see how the PCs are correlated with the different informations that we have on the patients.

The first PC seems to be most correlated with group, cGVHD and ABO.incompatibility:

The second PC seems to be slightly correlated with the disorder group:

The third PC seems to be slightly correlated with aGVHD and ABO.incompatibility:

The fourth PC seems to be most correlated with group and age_recip:

Since the 1st and 4th PCs seem to be most correlated with tolerance, we visualise the patients in these two dimensions. We observe that these 2 components seem to separate the tolerant recipients (on the bottom-left) from the non tolerant recipients (on the top-right). We can also see how the recipients are displayed in these two dimensions according to ABO incompatibility, recipients age and cGVHD.

We can also map the cryostem patients on thiese PCs:

other <- readRDS("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/integ_national/pca_2plot.RDS")
other4pca <- other %>% dplyr::select(colnames(mat4pca))
rownames(other4pca) <- other$Id.Cryostem.R

new_coords <- scale(other4pca, pca$center, pca$scale) %*% pca$rotation 

other_2plot <- other %>%
  dplyr::select(Id.Cryostem.R, group, age_recip, cGVHD) %>% 
  mutate(PC1 = new_coords[,1],
         PC4 = new_coords[,4])

all_2plot <- rbind(pca_2plot %>% 
                     dplyr::select(Id.Cryostem.R, group, age_recip, cGVHD, PC1, PC4) %>% 
                     mutate(cohort = rep("St_Louis", nrow(pca_2plot))),
                   other_2plot %>% 
                     mutate(cohort = rep("Cryostem", nrow(other_2plot))))

p1 <- ggplot(all_2plot, aes(x= PC1, y= PC4, col= group, label = Id.Cryostem.R)) + 
  geom_point(aes(shape = cohort, size = cohort)) +
  geom_text(aes(label=Id.Cryostem.R),hjust=-0.2, vjust=-0.2) +
  ggtitle("Separation of tolerant vs non tolerant recipients")
# p2 <- ggplot(pca_2plot, aes(x= PC1, y= PC4, col= ABO.incompatibility, label = Id.Cryostem.R)) + 
#   scale_color_manual(values = c("chartreuse2", "black", "gray48")) +
#   geom_point() +
#   geom_text(aes(label=Id.Cryostem.R),hjust=0, vjust=0) +
#   ggtitle("Separation of recipients based on ABO.incompatibility")
p3 <- ggplot(all_2plot, aes(x= PC1, y= PC4, col= age_recip, label = Id.Cryostem.R)) +
  geom_point(aes(shape = cohort, size = cohort)) +
  geom_text(aes(label=Id.Cryostem.R),hjust=-0.2, vjust=-0.2) +
  ggtitle("Gradient in the recipients based on age")
p4 <- ggplot(all_2plot, aes(x= PC1, y= PC4, col= as.factor(cGVHD), label = Id.Cryostem.R)) +
  scale_color_manual(values = c("black", "red")) +
  geom_point(aes(shape = cohort, size = cohort)) +
  geom_text(aes(label=Id.Cryostem.R),hjust=-0.2, vjust=-0.2) +
  ggtitle("Separation of recipients based on cGVHD")
{p1 +p3 + p4}
## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

## Warning: Using size for a discrete variable is not advised.

Features expressed in the first prinipal component

We can first look at the PC1, since it holds most of the data’s variability: We first focus on the features that are overexpressed in the tolerant recipients (= the features that are most expressed on the LEFT of the PC1). Therefore, the lower the feature value, the more it is expressed in tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the lowest features in the PC1 (under the -0.6 threshold) :

##                                           feature     overexp orig
## 1                                 ENSG00000228956 -0.08156929  rna
## 2                                 ENSG00000242861 -0.08037703  rna
## 3                                          CNKSR2 -0.08024199  rna
## 4                                          ABLIM1 -0.07877023  rna
## 5                                           SPON1 -0.07850535  rna
## 6                                 ENSG00000222078 -0.07788463  rna
## 7                                            HSF5 -0.07771093  rna
## 8                                           BACH2 -0.07761380  rna
## 9                                           AXIN2 -0.07732816  rna
## 10                                      LINC01215 -0.07698807  rna
## 11                                       C12orf42 -0.07669759  rna
## 12                                ENSG00000280123 -0.07658899  rna
## 13                                       IFNG.AS1 -0.07635503  rna
## 14                                          KLHL3 -0.07623209  rna
## 15                                         DCBLD2 -0.07554311  rna
## 16                                     LACTB2.AS1 -0.07500851  rna
## 17                                      LINC01259 -0.07499767  rna
## 18                                          NELL2 -0.07492519  rna
## 19                                       KLF3.AS1 -0.07490429  rna
## 20                                ENSG00000261685 -0.07409850  rna
## 21                           androsterone.sulfate -0.07407656  met
## 22                                         RNU6.8 -0.07387305  rna
## 23                                        GPRASP1 -0.07363176  rna
## 24                                          BEND5 -0.07350428  rna
## 25                                       SBF2.AS1 -0.07322580  rna
## 26        dehydroisoandrosterone.sulfate..DHEA.S. -0.07316363  met
## 27                                        RETREG1 -0.07308135  rna
## 28                                           RIC3 -0.07245566  rna
## 29                                         LARGE1 -0.07216967  rna
## 30                                            MCC -0.07181769  rna
## 31                                         ZNF256 -0.07165713  rna
## 32                                   LOC100507053 -0.07163273  rna
## 33                                          STRBP -0.07159202  rna
## 34                                          FBLN5 -0.07143257  rna
## 35                                ENSG00000218730 -0.07099955  rna
## 36  androstenediol..3beta.17beta..monosulfate..1. -0.07079649  met
## 37                                          PARM1 -0.07057739  rna
## 38                                           FCMR -0.07053659  rna
## 39                                          KCNH8 -0.07047437  rna
## 40                                          NPAS2 -0.07042575  rna
## 41                                ENSG00000276097 -0.07030503  rna
## 42                                          BCL7A -0.07019059  rna
## 43                        epiandrosterone.sulfate -0.06996150  met
## 44                                       GOLGA6L9 -0.06992679  rna
## 45                                   LOC105369827 -0.06980929  rna
## 46                                         PAIP2B -0.06944937  rna
## 47                                        ALDH5A1 -0.06913474  rna
## 48                                           SDK2 -0.06900392  rna
## 49                                ENSG00000224610 -0.06894275  rna
## 50                                         SPAG16 -0.06859450  rna
## 51                                ENSG00000211611 -0.06842521  rna
## 52                                           NT5E -0.06836499  rna
## 53                                       C19orf18 -0.06830055  rna
## 54                                          TCTN1 -0.06823324  rna
## 55                                          ZNF90 -0.06812791  rna
## 56                                          EPHA4 -0.06783398  rna
## 57                                         RASSF6 -0.06783038  rna
## 58                                ENSG00000270659 -0.06782274  rna
## 59                                          ACSM3 -0.06746280  rna
## 60                                         COBLL1 -0.06735532  rna
## 61                                          USP44 -0.06733209  rna
## 62                                ENSG00000218713 -0.06726899  rna
## 63                                           CLNK -0.06726424  rna
## 64                                       UBE2Q2P2 -0.06721565  rna
## 65                                          ABCB4 -0.06663164  rna
## 66     X1.stearoyl.2.arachidonoyl.GPC..18.0.20.4. -0.06638245  met
## 67                                         ZNF711 -0.06629134  rna
## 68                                        CNTNAP2 -0.06593364  rna
## 69                                            OCM -0.06558826  rna
## 70                                           TTC9 -0.06556499  rna
## 71                                ENSG00000214797 -0.06547693  rna
## 72                                        PPFIBP1 -0.06519230  rna
## 73                                        B3GALT2 -0.06471202  rna
## 74                                         P2RY10 -0.06465634  rna
## 75                                ENSG00000211679 -0.06459460  rna
## 76                                        MIR3679 -0.06459080  rna
## 77                                          SYNPO -0.06457483  rna
## 78                                          PLPP5 -0.06454993  rna
## 79                                         ZNF165 -0.06412746  rna
## 80                                         ZNF540 -0.06400003  rna
## 81                                          IL23R -0.06387896  rna
## 82                                           CHD7 -0.06376742  rna
## 83                                         AMIGO1 -0.06353382  rna
## 84                                          LAMP3 -0.06348932  rna
## 85          X23_DN..86.....CD8low..13...TSCM.like -0.06319538  cyt
## 86                                ENSG00000225569 -0.06299126  rna
## 87                                          NRCAM -0.06280542  rna
## 88                                          OVGP1 -0.06278661  rna
## 89                                           CDNF -0.06259539  rna
## 90                                ENSG00000235251 -0.06252194  rna
## 91                                ENSG00000234902 -0.06234214  rna
## 92                                ENSG00000211820 -0.06206949  rna
## 93                                           DAB1 -0.06203348  rna
## 94                                ENSG00000254777 -0.06168796  rna
## 95                                ENSG00000273062 -0.06150881  rna
## 96                                         MARCH3 -0.06145497  rna
## 97                                ENSG00000144642 -0.06072733  rna
## 98                                          TTC28 -0.06070103  rna
## 99                                        ZC3H12B -0.06061452  rna
## 100                                       SDR42E1 -0.06050211  rna
## 101                                         PWAR5 -0.06018475  rna

We then focus on the features that are overexpressed in the non tolerant recipients (= the features that are most expressed on the RIGHT of the PC1). Therefore, the higher the feature value, the more it is expressed in non tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the highest features in the PC1 (over the 0.5 threshold) :

##                                                                      feature
## 1                                                                     SRGAP1
## 2                                                                      P2RY1
## 3                                                                       C1QC
## 4                                                                     FLVCR2
## 5                                                                       IL10
## 6                                                                    ADAMTS2
## 7                                                                     PCYT1B
## 8                                                                       TCN2
## 9                                                                     TRIM58
## 10                                                                      MAFB
## 11                                                                      MMP8
## 12                                                                     MEIS1
## 13                                                                      C1QB
## 14                                                                      C1QA
## 15                                                                     ACSL1
## 16                                                                      TNS1
## 17                                                                     LTBP1
## 18                                                                     SPARC
## 19                                                                    SH3TC2
## 20                                                                   ARHGAP6
## 21                                                                      SELP
## 22                                                                     ITGB3
## 23                                                                      LCN2
## 24                                                                    ABLIM3
## 25                    X.CD38._10_T.population.CD8...DP.CD4..FoxP3...40.70...
## 26                                                                      MAOB
## 27                                                                    CTDSPL
## 28                                                                     TTC7B
## 29                                                                    LANCL3
## 30                                                                    MFSD2B
## 31                                                                  SH3PXD2B
## 32                                                                     PEAR1
## 33                            X.CD38._6_CD8.T.cells.FoxP3..CXCR3..CCR7..FAS.
## 34                                                                      SDC3
## 35                                                       X.CD38._27_CD4.Treg
## 36                                             X.CTLA4._24_CD4.TCM.Th17.like
## 37           X.CD38._20_NK.cells..66.....Conventional.DCs...some.moDCs..34..
## 38                                                                       CA1
## 39                                                                      FBP1
## 40                                                                   ANKRD22
## 41                                                                    TREML1
## 42 X.CD38._17_CD8...87...DP..10....TEMRA..70....TSCM..30....CXCR5..B.markers
## 43              X.CD24._2_CD8.TEM..47.....TEMRA..42.....TCM..4.....TSCM..3..
## 44                                                                  SERPING1
##       overexp orig
## 1  0.07303734  rna
## 2  0.07161886  rna
## 3  0.07095143  rna
## 4  0.07060925  rna
## 5  0.07044322  rna
## 6  0.06943897  rna
## 7  0.06662012  rna
## 8  0.06660225  rna
## 9  0.06563776  rna
## 10 0.06543088  rna
## 11 0.06515072  rna
## 12 0.06409927  rna
## 13 0.06391818  rna
## 14 0.06356339  rna
## 15 0.06329629  rna
## 16 0.06310627  rna
## 17 0.06225234  rna
## 18 0.06124110  rna
## 19 0.06045283  rna
## 20 0.05993583  rna
## 21 0.05951066  rna
## 22 0.05942689  rna
## 23 0.05921813  rna
## 24 0.05900356  rna
## 25 0.05891123  cyt
## 26 0.05873980  rna
## 27 0.05810987  rna
## 28 0.05768060  rna
## 29 0.05682371  rna
## 30 0.05677874  rna
## 31 0.05457941  rna
## 32 0.05424961  rna
## 33 0.05412829  cyt
## 34 0.05408341  rna
## 35 0.05370935  cyt
## 36 0.05360698  cyt
## 37 0.05326823  cyt
## 38 0.05300907  rna
## 39 0.05230039  rna
## 40 0.05119924  rna
## 41 0.05093662  rna
## 42 0.05071422  cyt
## 43 0.05028871  cyt
## 44 0.05012938  rna
Features expressed in the fourth prinipal component

We can first look at the PC1, since it holds most of the data’s variability: We first focus on the features that are overexpressed in the tolerant recipients (= the features that are most expressed on the LEFT of the PC1). Therefore, the lower the feature value, the more it is expressed in tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the lowest features in the PC1 (under the -0.6 threshold) :

##                                                                     feature
## 1                                                                   leucine
## 2                                                              beta.alanine
## 3                                                                   alanine
## 4                                                                isoleucine
## 5                                                                    valine
## 6                                                                 histidine
## 7                                                                asparagine
## 8  X4_CD8.Naives.CXCR3...60....CD8.naives..28...CD8.TSCM..16...CD8.TCM..7..
## 9                                                   X3.methyl.2.oxovalerate
## 10                                                               methionine
## 11                                                        choline.phosphate
## 12                                                                   lysine
## 13                                           cysteine.glutathione.disulfide
## 14                                                X4.methyl.2.oxopentanoate
## 15                                            X1.arachidonoyl.GPE..20.4n6..
## 16                                                               tryptophan
## 17                                                                 tyrosine
##       overexp orig
## 1  -0.1546621  met
## 2  -0.1490560  met
## 3  -0.1373994  met
## 4  -0.1275363  met
## 5  -0.1272318  met
## 6  -0.1254381  met
## 7  -0.1249464  met
## 8  -0.1241974  cyt
## 9  -0.1233013  met
## 10 -0.1227031  met
## 11 -0.1218951  met
## 12 -0.1182365  met
## 13 -0.1174700  met
## 14 -0.1153245  met
## 15 -0.1106526  met
## 16 -0.1084344  met
## 17 -0.1019113  met

We then focus on the features that are overexpressed in the non tolerant recipients (= the features that are most expressed on the TOP of the PC4). Therefore, the higher the feature value, the more it is expressed in non tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the highest features in the PC1 (over the 0.5 threshold) :

##                                                                     feature
## 1                                                                  PDCD1LG2
## 2                                                                    GBP1P1
## 3                                                                      GBP5
## 4                                                                 LINC02528
## 5                                          methyl.4.hydroxybenzoate.sulfate
## 6                              X.CD38._23_DN..86.....CD8low..13...TSCM.like
## 7                                                X.CD38._33_B.naives..100..
## 8                                                                    P2RY10
## 9 X.CD38._17_CD8...87...DP..10....TEMRA..70....TSCM..30....CXCR5..B.markers
##     overexp orig
## 1 0.1438614  rna
## 2 0.1289141  rna
## 3 0.1167383  rna
## 4 0.1125767  rna
## 5 0.1090165  met
## 6 0.1061148  cyt
## 7 0.1052441  cyt
## 8 0.1004525  rna
## 9 0.1004288  cyt

We can try to build a model that would predict a patients output (tolerant or not) based on the PCs:

## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
## L2 Regularized Linear Support Vector Machines with Class Weights 
## 
## 23 samples
## 23 predictors
##  2 classes: 'non_tolerant', 'tolerant' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 19, 19, 18, 18, 18 
## Resampling results across tuning parameters:
## 
##   cost  Loss  weight  Accuracy  Kappa    
##   0.25  L1    1       0.82      0.6571429
##   0.25  L1    2       0.82      0.6571429
##   0.25  L1    3       0.82      0.6571429
##   0.25  L2    1       0.82      0.6571429
##   0.25  L2    2       0.87      0.7571429
##   0.25  L2    3       0.83      0.6662338
##   0.50  L1    1       0.82      0.6571429
##   0.50  L1    2       0.82      0.6571429
##   0.50  L1    3       0.82      0.6571429
##   0.50  L2    1       0.82      0.6571429
##   0.50  L2    2       0.87      0.7571429
##   0.50  L2    3       0.83      0.6662338
##   1.00  L1    1       0.82      0.6571429
##   1.00  L1    2       0.82      0.6571429
##   1.00  L1    3       0.82      0.6571429
##   1.00  L2    1       0.82      0.6571429
##   1.00  L2    2       0.87      0.7571429
##   1.00  L2    3       0.83      0.6662338
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25, Loss = L2 and weight = 2.

Compare the features found in the PC1 and PC4 to the MOFA features:

In MOFA, the factor one is most tolerance related, it consists mainly of CyTOF and Metabo features. In this MOFA factor, the recipients are nocely split into non tolerant (low loadings) to tolerant (high loadings on the first factor).

cyto_MOFA <- read.csv("../data/integ/MOFA_res/MOFA_cyto_4F_weights.csv")
metabo_MOFA <- read.csv("../data/integ/MOFA_res/MOFA_meta_4F_weights.csv")
rna_MOFA <- read.csv("../data/integ/MOFA_res/MOFA_RNA_4F_weights.csv")
MOFA_res <- rbind(cyto_MOFA, metabo_MOFA, rna_MOFA) %>% 
  mutate(orig = c(rep("cyt", nrow(cyto_MOFA)),
                  rep("met", nrow(metabo_MOFA)),
                  rep("rna", nrow(rna_MOFA))))

PC1 loadings: (low loadings: tolerant ++, high loadings: non tolerant)

# pc1_tol <- read.xlsx("../data/integ/tol_ft_PC1.xlsx")
overexp <- as.data.frame(pca$rotation[,c(1,4)]) %>% 
  rownames_to_column("X") %>% 
  magrittr::set_colnames(c("X", "PC1", "PC4"))
orig_split <- strsplit(overexp$X, "_")
orig <- map(orig_split, function(x){x[1]}) %>% unlist

overexp <- overexp %>% 
  mutate(orig = orig,
         X = gsub("^[a-z]+_", "", X))

We can then compare the results of the PCA (focusing on the PCs 1 and 4, which were most associated with tolerance), to the 1st factor in MOFA (“LF1”): The concordance between the cyto and metabo features is quite good. The MOFA LF1 didn’t contain RNA features, which explains the uggly horizontal trend for this datatype.

both <- overexp %>% 
  left_join(MOFA_res, by = c("X", "orig")) %>% 
  mutate(PC1 = -PC1,
         PC4 = -PC4)
## Warning: Column `X` joining character vector and factor, coercing into character
## vector
ggplot(both, aes(x = PC1, y = LF1, col = orig)) +
  geom_point() +
  scale_fill_manual(values = c("palegreen3", "mediumorchid4", "sandybrown")) +
  ggtitle("Concordance between MOFA and PCA results")
## Warning: Removed 11 rows containing missing values (geom_point).

ggplot(both, aes(x = PC4, y = LF1, col = orig)) +
  geom_point() +
  scale_fill_manual(values = c("palegreen3", "mediumorchid4", "sandybrown")) +
  ggtitle("Concordance between MOFA and PCA results")
## Warning: Removed 11 rows containing missing values (geom_point).

I can make a heatmap of the most important genes along the PC1:

pc1_rna_table <- overexp %>% 
  filter(orig == "rna") %>% 
  mutate(PC1 = abs(PC1)) %>% 
  arrange(desc(PC1))
pc1_rna_features <- pc1_rna_table$X[1:20]

rna_table <- ft_integ[,paste0("rna_", "", pc1_rna_features)]
rownames(rna_table) <- ft_integ$Id.Cryostem.R
colnames(rna_table) <- gsub("rna_", "", colnames(rna_table))

info_table <- info_complete_patients %>% 
  mutate(cGVHD = as.factor(cGVHD)) %>% 
  column_to_rownames("Id.Cryostem.R") 
info_table <- info_table[rownames(rna_table),]

# arrange patients per PC1:
pca_table <- pca_2plot %>% 
  select(Id.Cryostem.R, PC1, PC4) %>% 
  arrange(PC1) 

info_table <- info_table[pca_table$Id.Cryostem.R,]
rna_table <- rna_table[pca_table$Id.Cryostem.R,]

pheatmap::pheatmap(t(rna_table), cluster_cols = FALSE,
                   annotation_col = info_table[,c("group", "cGVHD", "ABO.incompatibility")], 
                   annotation_colors = list(group = c(non_tolerant = "red", tolerant = "blue"),
                                            ABO.incompatibility = c(Major = "red", 
                                                                    Minor = "orange", 
                                                                    Compatible = "blue")))

I can make a heatmap of the most important metabolites along the PC1:

pc1_met_table <- overexp %>% 
  filter(orig == "met") %>% 
  mutate(PC1 = abs(PC1)) %>% 
  arrange(desc(PC1))
pc1_met_features <- pc1_met_table$X[1:20]

met_table <- ft_integ[,paste0("met_", "", pc1_met_features)]
rownames(met_table) <- ft_integ$Id.Cryostem.R
colnames(met_table) <- gsub("met_", "", colnames(met_table))

info_table <- info_complete_patients %>% 
  mutate(cGVHD = as.factor(cGVHD)) %>% 
  column_to_rownames("Id.Cryostem.R") 
info_table <- info_table[rownames(met_table),]

# arrange patients per PC1:
pca_table <- pca_2plot %>% 
  select(Id.Cryostem.R, PC1, PC4) %>% 
  arrange(PC1) 

info_table <- info_table[pca_table$Id.Cryostem.R,]
met_table <- met_table[pca_table$Id.Cryostem.R,]

pheatmap::pheatmap(t(met_table), cluster_cols = FALSE,
                   annotation_col = info_table[,c("group", "cGVHD", "ABO.incompatibility")], 
                   annotation_colors = list(group = c(non_tolerant = "red", tolerant = "blue"),
                                            ABO.incompatibility = c(Major = "red", 
                                                                    Minor = "orange", 
                                                                    Compatible = "blue")))

I can make a heatmap of the most important CyTOF features along the PC1:

pc1_cyt_table <- overexp %>% 
  filter(orig == "cyt") %>% 
  mutate(PC1 = abs(PC1)) %>% 
  arrange(desc(PC1))
pc1_cyt_features <- pc1_cyt_table$X[1:20]

cyt_table <- ft_integ[,paste0("cyt_", "", pc1_cyt_features)]
rownames(cyt_table) <- ft_integ$Id.Cryostem.R
colnames(cyt_table) <- gsub("cyt_", "", colnames(cyt_table))

info_table <- info_complete_patients %>% 
  mutate(cGVHD = as.factor(cGVHD)) %>% 
  column_to_rownames("Id.Cryostem.R") 
info_table <- info_table[rownames(cyt_table),]

# arrange patients per PC1:
pca_table <- pca_2plot %>% 
  select(Id.Cryostem.R, PC1, PC4) %>% 
  arrange(PC1) 

info_table <- info_table[pca_table$Id.Cryostem.R,]
cyt_table <- cyt_table[pca_table$Id.Cryostem.R,]

pheatmap::pheatmap(t(cyt_table), cluster_cols = FALSE,
                   annotation_col = info_table[,c("group", "cGVHD", "ABO.incompatibility")], 
                   annotation_colors = list(group = c(non_tolerant = "red", tolerant = "blue"),
                                            ABO.incompatibility = c(Major = "red", 
                                                                    Minor = "orange", 
                                                                    Compatible = "blue")))

I can make a heatmap of the most important genes along the PC4:

pc4_rna_table <- overexp %>% 
  filter(orig == "rna") %>% 
  mutate(PC4 = abs(PC4)) %>% 
  arrange(desc(PC4))
pc4_rna_features <- pc4_rna_table$X[1:20]

rna_table <- ft_integ[,paste0("rna_", "", pc4_rna_features)]
rownames(rna_table) <- ft_integ$Id.Cryostem.R
colnames(rna_table) <- gsub("rna_", "", colnames(rna_table))

info_table <- info_complete_patients %>% 
  mutate(cGVHD = as.factor(cGVHD)) %>% 
  column_to_rownames("Id.Cryostem.R") 
info_table <- info_table[rownames(rna_table),]

# arrange patients per PC1:
pca_table <- pca_2plot %>% 
  select(Id.Cryostem.R, PC1, PC4) %>% 
  arrange(PC4) 

info_table <- info_table[pca_table$Id.Cryostem.R,]
rna_table <- rna_table[pca_table$Id.Cryostem.R,]

pheatmap::pheatmap(t(rna_table), cluster_cols = FALSE,
                   annotation_col = info_table[,c("group", "age_recip")], 
                   annotation_colors = list(group = c(non_tolerant = "red", tolerant = "blue")))

Separate data type analyses

For comparison, we can see if we would have managed to separate the tolerant from non tolerant recipients as well without integrating the data:

CyTOF:

We perform PCA using only the 24 features that were selected as informative in both cohorts:

The first comoponent seems to be capturing most of the data variability:

We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC6 is the 2nd mostly correlated with tolerance.

The non tolerant recipients are slightly less well separated from the tolerant recipients than in the integrative PCA.

We can also have a look at the features that are driving these two components:

We can try to build a model that would predict a patients output (tolerant or not) based on the PCs:

## L2 Regularized Linear Support Vector Machines with Class Weights 
## 
## 23 samples
## 10 predictors
##  2 classes: 'non_tolerant', 'tolerant' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 19, 19, 18, 18, 18 
## Resampling results across tuning parameters:
## 
##   cost  Loss  weight  Accuracy  Kappa    
##   0.25  L1    1       0.79      0.6142857
##   0.25  L1    2       0.79      0.6142857
##   0.25  L1    3       0.79      0.6142857
##   0.25  L2    1       0.79      0.6142857
##   0.25  L2    2       0.75      0.5233766
##   0.25  L2    3       0.71      0.4476190
##   0.50  L1    1       0.75      0.5233766
##   0.50  L1    2       0.75      0.5233766
##   0.50  L1    3       0.75      0.5233766
##   0.50  L2    1       0.75      0.5233766
##   0.50  L2    2       0.75      0.5233766
##   0.50  L2    3       0.75      0.5233766
##   1.00  L1    1       0.75      0.5233766
##   1.00  L1    2       0.75      0.5233766
##   1.00  L1    3       0.75      0.5233766
##   1.00  L2    1       0.71      0.4476190
##   1.00  L2    2       0.71      0.4476190
##   1.00  L2    3       0.75      0.5233766
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25, Loss = L1 and weight = 1.

The accuracy is quite lower than in the integrative model.

To see if the features that we selected can help us to clssify the patients, we can also apply a 5 fold cross validation setting on the CyTOF features of the same 23 patients.

set.seed(1)
info_rf <- info_complete_patients %>% column_to_rownames("Id.Cryostem.R")
info_rf <- info_rf[rownames(cyt4pca),]
cyt4rf <- cbind(info_rf$group, cyt4pca)
colnames(cyt4rf)[1] <- "group"

cv_id <- list(c(1:5), c(6:10), c(11:15), c(16:20), c(21:23))
cv_accuracies <- lapply(seq_along(cv_id), function(idx){
  train_set <- cyt4rf[-cv_id[[idx]], ]
  test_set <- cyt4rf[cv_id[[idx]], -1]
  true_labels <- cyt4rf[cv_id[[idx]], "group"]
  rf <- ranger::ranger(group~., train_set, num.trees = 5000,
                           importance = "impurity")
  pred <- predict(rf, test_set)
  acc <- ( length(which(pred$predictions == true_labels))/length(true_labels) )
})

accuracy <- mean(unlist(cv_accuracies))
accuracy
## [1] 0.88
Metabolomics:

Variability captured by each component:

We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC4 is the 2nd mostly correlated with tolerance.

The non tolerant recipients are slightly less well separated from the tolerant recipients than in the integrative PCA.

We can try to build a model that would predict a patients output (tolerant or not) based on the PCs:

## L2 Regularized Linear Support Vector Machines with Class Weights 
## 
## 23 samples
## 10 predictors
##  2 classes: 'non_tolerant', 'tolerant' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 19, 19, 18, 18, 18 
## Resampling results across tuning parameters:
## 
##   cost  Loss  weight  Accuracy  Kappa    
##   0.25  L1    1       0.83      0.6424242
##   0.25  L1    2       0.83      0.6424242
##   0.25  L1    3       0.83      0.6424242
##   0.25  L2    1       0.83      0.6424242
##   0.25  L2    2       0.79      0.5655012
##   0.25  L2    3       0.79      0.5655012
##   0.50  L1    1       0.83      0.6424242
##   0.50  L1    2       0.83      0.6424242
##   0.50  L1    3       0.83      0.6424242
##   0.50  L2    1       0.79      0.5655012
##   0.50  L2    2       0.79      0.5655012
##   0.50  L2    3       0.79      0.5655012
##   1.00  L1    1       0.79      0.5655012
##   1.00  L1    2       0.79      0.5655012
##   1.00  L1    3       0.79      0.5655012
##   1.00  L2    1       0.74      0.4655012
##   1.00  L2    2       0.74      0.4655012
##   1.00  L2    3       0.74      0.4655012
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25, Loss = L1 and weight = 1.

Again, the accuracy of the model is slightly lower than the accuracy of the integrative model.

To see if the features that we selected can help us to clssify the patients, we can also apply a 5 fold cross validation setting on the metabolites of the same 23 patients.

set.seed(1)
info_rf <- info_complete_patients %>% column_to_rownames("Id.Cryostem.R")
info_rf <- info_rf[rownames(met4pca),]
met4rf <- cbind(info_rf$group, met4pca)
colnames(met4rf)[1] <- "group"

cv_id <- list(c(1:5), c(6:10), c(11:15), c(16:20), c(21:23))
cv_accuracies <- lapply(seq_along(cv_id), function(idx){
  train_set <- met4rf[-cv_id[[idx]], ]
  test_set <- met4rf[cv_id[[idx]], -1]
  true_labels <- met4rf[cv_id[[idx]], "group"]
  rf <- ranger::ranger(group~., train_set, num.trees = 5000,
                           importance = "impurity")
  pred <- predict(rf, test_set)
  acc <- ( length(which(pred$predictions == true_labels))/length(true_labels) )
})

accuracy <- mean(unlist(cv_accuracies))
accuracy
## [1] 0.8266667
Transcriptomics:

Variability captured by each component:

We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC8 is the 2nd mostly correlated with tolerance.

The non tolerant recipients are slightly less well separated from the tolerant recipients than in the integrative PCA.

We can try to build a model that would predict a patients output (tolerant or not) based on the PCs:

## L2 Regularized Linear Support Vector Machines with Class Weights 
## 
## 23 samples
## 10 predictors
##  2 classes: 'non_tolerant', 'tolerant' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 19, 19, 18, 18, 18 
## Resampling results across tuning parameters:
## 
##   cost  Loss  weight  Accuracy  Kappa     
##   0.25  L1    1       0.53      0.03846154
##   0.25  L1    2       0.53      0.03846154
##   0.25  L1    3       0.53      0.03846154
##   0.25  L2    1       0.53      0.03846154
##   0.25  L2    2       0.53      0.03846154
##   0.25  L2    3       0.53      0.03846154
##   0.50  L1    1       0.53      0.03846154
##   0.50  L1    2       0.53      0.03846154
##   0.50  L1    3       0.53      0.03846154
##   0.50  L2    1       0.53      0.03846154
##   0.50  L2    2       0.53      0.03846154
##   0.50  L2    3       0.53      0.03846154
##   1.00  L1    1       0.53      0.03846154
##   1.00  L1    2       0.53      0.03846154
##   1.00  L1    3       0.53      0.03846154
##   1.00  L2    1       0.53      0.03846154
##   1.00  L2    2       0.53      0.03846154
##   1.00  L2    3       0.53      0.03846154
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25, Loss = L1 and weight = 1.

It looks like, in our integrative approach, we managed to separate the recipients based on their tolerance better than when we look at the different data types separately.

To see if the features that we selected can help us to clssify the patients, we can also apply a 5 fold cross validation setting on the genes of the same 23 patients.

set.seed(1)
info_rf <- info_complete_patients %>% column_to_rownames("Id.Cryostem.R")
info_rf <- info_rf[rownames(rna4pca),]
rna4rf <- cbind(info_rf$group, rna4pca)
colnames(rna4rf)[1] <- "group"

cv_id <- list(c(1:5), c(6:10), c(11:15), c(16:20), c(21:23))
cv_accuracies <- lapply(seq_along(cv_id), function(idx){
  train_set <- rna4rf[-cv_id[[idx]], ]
  test_set <- rna4rf[cv_id[[idx]], -1]
  true_labels <- rna4rf[cv_id[[idx]], "group"]
  rf <- ranger::ranger(group~., train_set, num.trees = 5000,
                           importance = "impurity")
  pred <- predict(rf, test_set)
  acc <- ( length(which(pred$predictions == true_labels))/length(true_labels) )
})

accuracy <- mean(unlist(cv_accuracies))
accuracy
## [1] 0.84

Primary versus secondary tolerant recipients

We now focus on the features that seemed to play a role when trying to disinguish primary from secondary tolerant recipients.

We first need to gather the features that were extracted from the three data sources:

  • CyTOF features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.

  • Metabolomics features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.

  • Transcriptomics features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.

I extract informations on the recipients group, age, gender, … to be able to use this information later on.

## [1] "0"
## [1] "1"
##          Id.Cryostem.R            GROUP GENDER        DOB        DOG
## D1_TOL           D2497     non_tolerant      F 1962-07-06       <NA>
## R1_2_TOL         R2498     non_tolerant      F 1954-12-05 2014-08-06
## D2_NoTOL          D284     non_tolerant      F 1967-04-03       <NA>
## R2_NoTOL          R385     non_tolerant      M 1953-12-16 2013-05-02
## D3_TOL           D1502 primary_tolerant      F 1990-04-03       <NA>
## R3_1_TOL         R1503 primary_tolerant      M 1991-05-19 2014-01-02
##          gender_comp age_recip cmv_comp
## D1_TOL            FF     21794       01
## R1_2_TOL          FF     21794       01
## D2_NoTOL          FM     21687      NA1
## R2_NoTOL          FM     21687      NA1
## D3_TOL            FM      8264       11
## R3_1_TOL          FM      8264       11

I add a prefix (“cyt_”, “rna_” or “met_”) in front of the features names, and I then merge all features together:

## [1] "We have information on the CyTOF, metabolomics and transcriptomics data for 10 patients in the St Louis cohort."

Volcano plot analysis

We can do a volcano plot on the CyTOF features metabolites and genes of the non versus tolerant recipients:

Correlation analysis

Principal component analysis

We can first generate a PCA on the data:

The first principal component seems to hold much more information that the others

We can investigate how much of the principal components are driven by the different data types.

The RNAseq data variability was captured much more than the other data types. This is most probably due to the fact that we selected 278 genes, against 42 metabolites and 24 CyTOF features only.

We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC2 also looks correlated with tolerance.

We can then see how the PCs are correlated with the different informations that we have on the patients.

The first PC seems to be most correlated with group, SAL and blood incompatibility:

The second PC seems to be most correlated with the source of graft, cGVHD and group:

The third PC seems to be correlated with cmv:

The fourth PC seems to be slightly correlated with the recipient’s age:

Since the 1st and 2nd PCs seem to be most correlated with tolerance, we visualise the patients in these two dimensions. We observe that these 2 components seem to separate the secondary tolerant recipients (on the bottom-left) from the primary tolerant recipients (on the top-right). We can also see how the recipients are displayed in these two dimensions according to recipients age and aGVHD.

Features expressed in the first principal component

We can first look at the PC1, since it holds most of the data’s variability: We first focus on the features that are overexpressed in the primary tolerant recipients (= the features that are most expressed on the LEFT of the PC1). Therefore, the lower the feature value, the more it is expressed in primary tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the lowest features in the PC1 (under the -0.2 threshold) :

##                feature     overexp orig
## 1               CDC25A -0.25699203  rna
## 2                MZT2B -0.25058094  rna
## 3                PCLAF -0.24943522  rna
## 4                 E2F8 -0.24781996  rna
## 5                TOP2A -0.24297857  rna
## 6                UBE2C -0.24196127  rna
## 7                MKI67 -0.24151215  rna
## 8              SNORD17 -0.23577399  rna
## 9                 ASB2 -0.22842881  rna
## 10              FAM83D -0.22491879  rna
## 11     ENSG00000233597 -0.22387887  rna
## 12                MAPT -0.21923703  rna
## 13     ENSG00000210107 -0.21302546  rna
## 14     ENSG00000280981 -0.21246004  rna
## 15             CCDC157 -0.20362641  rna
## 16               KLRC2 -0.18556896  rna
## 17               ASAP3 -0.17480115  rna
## 18               AGMAT -0.17203260  rna
## 19        LOC101929574 -0.16686789  rna
## 20              TMEM53 -0.13343402  rna
## 21 X3.aminoisobutyrate -0.08116272  met
## 22          GPRC5D.AS1 -0.05982952  rna

We then focus on the features that are overexpressed in the secondary tolerant recipients (= the features that are most expressed on the RIGHT of the PC1). Therefore, the higher the feature value, the more it is expressed in secondary tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the highest features in the PC1 (over the 0.05 threshold) :

##                              feature    overexp orig
## 1 X.CD38._16_T.population.Naive.TSCM 0.12087507  cyt
## 2                    ENSG00000256658 0.08235933  rna
## 3                             PRUNE2 0.07847496  rna
## 4           X2.palmitoyl.GPC..16.0.. 0.06876445  met
Features expressed in the third prinipal component

We first focus on the features that are overexpressed in the secondary tolerant recipients (= the features that are most expressed at the BOTTOM of the PC2). Therefore, the lower the feature value, the more it is expressed in secondary tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the lowest features in the PC2 (under the -0.3 threshold) :

##                              feature    overexp orig
## 1                 X.CD38._29_DP.Treg -0.4574753  cyt
## 2 X.CD38._16_T.population.Naive.TSCM -0.3850877  cyt
## 3                              HACD1 -0.3342307  rna

We then focus on the features that are overexpressed in the primary tolerant recipients (= the features that are most expressed at the TOP of the PC2). Therefore, the higher the feature value, the more it is expressed in primary tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the highest features in the PC2 (over the 0.2 threshold) :

##               feature   overexp orig
## 1 X3.aminoisobutyrate 0.4032602  met
## 2               KLRC2 0.3071328  rna

We can try to build a model that would predict a patients output (tolerant or not) based on the PCs:

## L2 Regularized Linear Support Vector Machines with Class Weights 
## 
## 10 samples
## 10 predictors
##  2 classes: 'primary_tolerant', 'secondary_tolerant' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 8, 8, 8, 8, 8 
## Resampling results across tuning parameters:
## 
##   cost  Loss  weight  Accuracy  Kappa
##   0.25  L1    1       1.0       1.0  
##   0.25  L1    2       1.0       1.0  
##   0.25  L1    3       1.0       1.0  
##   0.25  L2    1       1.0       1.0  
##   0.25  L2    2       1.0       1.0  
##   0.25  L2    3       1.0       1.0  
##   0.50  L1    1       1.0       1.0  
##   0.50  L1    2       1.0       1.0  
##   0.50  L1    3       1.0       1.0  
##   0.50  L2    1       1.0       1.0  
##   0.50  L2    2       1.0       1.0  
##   0.50  L2    3       1.0       1.0  
##   1.00  L1    1       1.0       1.0  
##   1.00  L1    2       1.0       1.0  
##   1.00  L1    3       1.0       1.0  
##   1.00  L2    1       0.9       0.8  
##   1.00  L2    2       1.0       1.0  
##   1.00  L2    3       1.0       1.0  
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25, Loss = L1 and weight = 1.

Separate data type analyses

For comparison, we can see if we would have managed to separate the tolerant from non tolerant recipients as well without integrating the data:

CyTOF:

We apply PCA only on the 24 CyTOF features:

Variability captured by each component:

Since we have only two CyTOF features, the separation is not that good:

The secondary tolerant recipients are badly separated from the primary tolerant recipients compared to the integrative PCA.

Metabolomics:

Variability captured by each component:

The non tolerant recipients are slightly better separated from the tolerant recipients than in the integrative PCA.

Transcriptomics:

Variability captured by each component:

We can see how these principal components are correlated with tolerance. The fourth PC is very correlated with the tolerance. The PC1 is the 2nd mostly correlated with tolerance.

The non tolerant recipients are slightly better separated from the tolerant recipients than in the integrative PCA.

We can try to build a model that would predict a patients output (tolerant or not) based on the PCs:

## L2 Regularized Linear Support Vector Machines with Class Weights 
## 
## 10 samples
##  9 predictor
##  2 classes: 'primary_tolerant', 'secondary_tolerant' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 8, 8, 8, 8, 8 
## Resampling results across tuning parameters:
## 
##   cost  Loss  weight  Accuracy  Kappa
##   0.25  L1    1       0.9       0.8  
##   0.25  L1    2       0.9       0.8  
##   0.25  L1    3       0.9       0.8  
##   0.25  L2    1       0.9       0.8  
##   0.25  L2    2       0.9       0.8  
##   0.25  L2    3       0.9       0.8  
##   0.50  L1    1       0.9       0.8  
##   0.50  L1    2       0.9       0.8  
##   0.50  L1    3       0.9       0.8  
##   0.50  L2    1       0.9       0.8  
##   0.50  L2    2       0.9       0.8  
##   0.50  L2    3       0.9       0.8  
##   1.00  L1    1       0.9       0.8  
##   1.00  L1    2       0.9       0.8  
##   1.00  L1    3       0.9       0.8  
##   1.00  L2    1       0.9       0.8  
##   1.00  L2    2       0.9       0.8  
##   1.00  L2    3       0.9       0.8  
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25, Loss = L1 and weight = 1.

Adding information on the couples to the recipients

We can then add the features that we extracted as informative in the couples, to see if, taken together, they would be more informative than when working on the integrated data of recipients alone.

Tolerant versus non tolerant recipients

Since looking at differences between the three groups consisting of primary, secondary and non tolerant patients at the same time wasn’t feasible, we divided our problem in two distinct sub-problems. We first focus on the features that seemed to play a role when trying to disinguish non tolerant from tolerant recipients and couples.

I extract informations on the recipients group, age, gender, … to be able to use this information later on.

## [1] "0"
## [1] "1"
##          Id.Cryostem.R            GROUP GENDER        DOB        DOG
## D1_TOL           D2497     non_tolerant      F 1962-07-06       <NA>
## R1_2_TOL         R2498     non_tolerant      F 1954-12-05 2014-08-06
## D2_NoTOL          D284     non_tolerant      F 1967-04-03       <NA>
## R2_NoTOL          R385     non_tolerant      M 1953-12-16 2013-05-02
## D3_TOL           D1502 primary_tolerant      F 1990-04-03       <NA>
## R3_1_TOL         R1503 primary_tolerant      M 1991-05-19 2014-01-02
##          gender_comp age_recip cmv_comp
## D1_TOL            FF     21794       01
## R1_2_TOL          FF     21794       01
## D2_NoTOL          FM     21687      NA1
## R2_NoTOL          FM     21687      NA1
## D3_TOL            FM      8264       11
## R3_1_TOL          FM      8264       11

We then need to gather the features that were extracted from the three data sources:

  • CyTOF features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.

  • Metabolomics features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.

  • Transcriptomics features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.

I merge all features together:

## [1] "We have information on the CyTOF, metabolomics and transcriptomics data for 23 patients in the St Louis cohort."

Principal component analysis

We can first generate a PCA on the data:

The first principal component seems to hold much more information that the others

We can investigate how much of the principal components are driven by the different data types.

The RNAseq data variability was captured much more than the other data types. This is most probably due to the fact that we selected 365 genes, against 45 metabolites and 49 CyTOF features only.

We can also see how the features coming from recipients and couples are distributed in the PCs:

We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC4 (and slightly PC9 and PC10) also look correlated with tolerance.

We can then see how the PCs are correlated with the different informations that we have on the patients.

The first PC seems to be most correlated with group, cGVHD and ABO.incompatibility:

The second PC seems to be slightly correlated with the disorder group:

The third PC seems to be slightly correlated with GENDER:

The fourth PC seems to be most correlated with group and age_recip:

Since the 1st and 4th PCs seem to be most correlated with tolerance, we visualise the patients in these two dimensions. We observe that these 2 components seem to separate the tolerant recipients (on the bottom-left) from the non tolerant recipients (on the top-right). We can also see how the recipients are displayed in these two dimensions according to ABO incompatibility, recipients age and cGVHD.

Features expressed in the first principal component

We can first look at the PC1, since it holds most of the data’s variability: We first focus on the features that are overexpressed in the tolerant recipients (= the features that are most expressed on the LEFT of the PC1). Therefore, the lower the feature value, the more it is expressed in tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the lowest features in the PC1 (under the -0.6 threshold) :

##                                            feature     overexp orig
## 1                                R_ENSG00000228956 -0.07214794  rna
## 2                                R_ENSG00000242861 -0.07137799  rna
## 3                                         R_ABLIM1 -0.07091034  rna
## 4                                         R_CNKSR2 -0.07070764  rna
## 5                                          R_AXIN2 -0.07020108  rna
## 6                                          R_SPON1 -0.06978692  rna
## 7                                       R_IFNG.AS1 -0.06964967  rna
## 8                                      R_LINC01215 -0.06919056  rna
## 9                                          R_BACH2 -0.06905632  rna
## 10                                         R_KLHL3 -0.06891575  rna
## 11                               R_ENSG00000280123 -0.06888325  rna
## 12                               R_ENSG00000222078 -0.06877962  rna
## 13                                          R_HSF5 -0.06874967  rna
## 14                                    R_LACTB2.AS1 -0.06847038  rna
## 15                                      R_C12orf42 -0.06806523  rna
## 16                                         R_NELL2 -0.06766421  rna
## 17                                        R_DCBLD2 -0.06726008  rna
## 18                                     R_LINC01259 -0.06725230  rna
## 19                               R_ENSG00000261685 -0.06630861  rna
## 20                                      R_KLF3.AS1 -0.06628893  rna
## 21                                        R_RNU6.8 -0.06594652  rna
## 22                          R_androsterone.sulfate -0.06581100  met
## 23                                       R_GPRASP1 -0.06560190  rna
## 24                                           R_MCC -0.06538130  rna
## 25                                      R_SBF2.AS1 -0.06516144  rna
## 26       R_dehydroisoandrosterone.sulfate..DHEA.S. -0.06497831  met
## 27                                         R_BEND5 -0.06469171  rna
## 28                                      RD_ADAMTS2 -0.06468782  rna
## 29                                        R_ZNF256 -0.06462537  rna
## 30                                          R_RIC3 -0.06452537  rna
## 31                                         R_NPAS2 -0.06420972  rna
## 32                                       R_RETREG1 -0.06389604  rna
## 33                                        R_LARGE1 -0.06374134  rna
## 34                                         R_FBLN5 -0.06372004  rna
## 35                                         R_STRBP -0.06329040  rna
## 36                                         R_PARM1 -0.06322360  rna
## 37                               R_ENSG00000276097 -0.06316167  rna
## 38 R_androstenediol..3beta.17beta..monosulfate..1. -0.06312344  met
## 39                                         R_EPHA4 -0.06276244  rna
## 40                                         R_BCL7A -0.06272235  rna
## 41                               R_ENSG00000218730 -0.06260937  rna
## 42                                  R_LOC105369827 -0.06260816  rna
## 43                       R_epiandrosterone.sulfate -0.06235016  met
## 44                                  R_LOC100507053 -0.06224064  rna
## 45                                         R_KCNH8 -0.06223713  rna
## 46                                          R_FCMR -0.06223590  rna
## 47                                      R_GOLGA6L9 -0.06216064  rna
## 48                                          R_CLNK -0.06144979  rna
## 49                                          R_SDK2 -0.06135589  rna
## 50                                        R_RASSF6 -0.06131564  rna
## 51                               R_ENSG00000270659 -0.06127537  rna
## 52                                        R_SPAG16 -0.06124884  rna
## 53                               R_ENSG00000211611 -0.06117975  rna
## 54                                       R_ALDH5A1 -0.06083118  rna
## 55                                        R_PAIP2B -0.06057827  rna
## 56                                          R_NT5E -0.06021522  rna
## 57                                         R_TCTN1 -0.06008607  rna
## 58                                        R_COBLL1 -0.06005623  rna

We then focus on the features that are overexpressed in the non tolerant recipients (= the features that are most expressed on the RIGHT of the PC1). Therefore, the higher the feature value, the more it is expressed in non tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the highest features in the PC1 (over the 0.5 threshold) :

##                                                     feature    overexp orig
## 1                                                 RD_RNU6.8 0.06755602  rna
## 2                                                  R_SRGAP1 0.06694151  rna
## 3                                                    R_C1QC 0.06432033  rna
## 4                                                  R_FLVCR2 0.06368485  rna
## 5                                                    R_IL10 0.06341900  rna
## 6                                                   R_P2RY1 0.06336965  rna
## 7                                             RD_LACTB2.AS1 0.06282644  rna
## 8                                                 R_ADAMTS2 0.06232133  rna
## 9                                               RD_C12orf42 0.06129320  rna
## 10                                                   R_TCN2 0.06079819  rna
## 11                                              RD_GOLGA6L9 0.05996115  rna
## 12                                                 RD_NPHP1 0.05980620  rna
## 13                                                   R_MMP8 0.05902071  rna
## 14                                                 R_PCYT1B 0.05873280  rna
## 15                                       RD_ENSG00000236946 0.05863950  rna
## 16                                                   R_MAFB 0.05843769  rna
## 17                                                RD_RASSF6 0.05840565  rna
## 18                                                 R_TRIM58 0.05836639  rna
## 19                                                   R_C1QB 0.05816563  rna
## 20                                       RD_ENSG00000256708 0.05812740  rna
## 21                                                 RD_PARM1 0.05802210  rna
## 22                                                  R_ACSL1 0.05761205  rna
## 23                                          RD_LOC105369827 0.05750758  rna
## 24                                                   R_C1QA 0.05721502  rna
## 25                                              RD_IFNG.AS1 0.05683979  rna
## 26                                                   R_TNS1 0.05665430  rna
## 27                                                  R_MEIS1 0.05605488  rna
## 28                                       RD_ENSG00000226254 0.05520352  rna
## 29                                                  R_LTBP1 0.05481638  rna
## 30                                             RD_LINC01781 0.05472895  rna
## 31                                               RD_SLCO5A1 0.05400467  rna
## 32                                                RD_SPINK2 0.05379921  rna
## 33                                                  R_SPARC 0.05350108  rna
## 34                                                 R_SH3TC2 0.05314816  rna
## 35                                                 RD_PLPP5 0.05311318  rna
## 36                                                   R_LCN2 0.05307289  rna
## 37 R_X.CD38._10_T.population.CD8...DP.CD4..FoxP3...40.70... 0.05286087  cyt
## 38                                             RD_LINC00540 0.05282655  rna
## 39                                                RD_FCER1A 0.05278899  rna
## 40                                                   R_MAOB 0.05262794  rna
## 41                                                R_ARHGAP6 0.05241173  rna
## 42                                                 R_ABLIM3 0.05238824  rna
## 43                                       RD_ENSG00000224093 0.05229034  rna
## 44                                                RD_COL4A3 0.05218472  rna
## 45                                                  R_ITGB3 0.05201325  rna
## 46                                                   R_SELP 0.05167402  rna
## 47                                               R_SH3PXD2B 0.05152729  rna
## 48                                                 R_CTDSPL 0.05141839  rna
## 49                                              RD_C19orf18 0.05065875  rna
## 50                                            RD_ANKRD36BP2 0.05060710  rna
##    type
## 1    RD
## 2     R
## 3     R
## 4     R
## 5     R
## 6     R
## 7    RD
## 8     R
## 9    RD
## 10    R
## 11   RD
## 12   RD
## 13    R
## 14    R
## 15   RD
## 16    R
## 17   RD
## 18    R
## 19    R
## 20   RD
## 21   RD
## 22    R
## 23   RD
## 24    R
## 25   RD
## 26    R
## 27    R
## 28   RD
## 29    R
## 30   RD
## 31   RD
## 32   RD
## 33    R
## 34    R
## 35   RD
## 36    R
## 37    R
## 38   RD
## 39   RD
## 40    R
## 41    R
## 42    R
## 43   RD
## 44   RD
## 45    R
## 46    R
## 47    R
## 48    R
## 49   RD
## 50   RD
Features expressed in the fourth prinipal component

We can first look at the PC1, since it holds most of the data’s variability: We first focus on the features that are overexpressed in the tolerant recipients (= the features that are most expressed on the LEFT of the PC1). Therefore, the lower the feature value, the more it is expressed in tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the lowest features in the PC1 (under the -0.6 threshold) :

##                                                             feature    overexp
## 1                    R_X.CD38._23_DN..86.....CD8low..13...TSCM.like -0.1112053
## 2 R_X.CD38._20_NK.cells..66.....Conventional.DCs...some.moDCs..34.. -0.1097242
## 3                                             R_X.CD38._27_CD4.Treg -0.1080819
## 4                              R_X.CD38._35_Conventional.DCs...90.. -0.1062696
##   orig type
## 1  cyt    R
## 2  cyt    R
## 3  cyt    R
## 4  cyt    R

We then focus on the features that are overexpressed in the non tolerant recipients (= the features that are most expressed on the RIGHT of the PC1). Therefore, the higher the feature value, the more it is expressed in non tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the highest features in the PC1 (over the 0.5 threshold) :

##                                                                       feature
## 1                                                                   R_leucine
## 2                                                                R_methionine
## 3  R_X4_CD8.Naives.CXCR3...60....CD8.naives..28...CD8.TSCM..16...CD8.TCM..7..
## 4                                                              R_beta.alanine
## 5                                                                   R_alanine
## 6                                                                R_asparagine
## 7                                                                R_isoleucine
## 8                                                                    R_valine
## 9                                                   R_X3.methyl.2.oxovalerate
## 10                                                                R_histidine
##      overexp orig type
## 1  0.1317358  met    R
## 2  0.1285054  met    R
## 3  0.1168507  cyt    R
## 4  0.1130490  met    R
## 5  0.1116219  met    R
## 6  0.1106410  met    R
## 7  0.1097835  met    R
## 8  0.1078810  met    R
## 9  0.1007362  met    R
## 10 0.1000724  met    R

We can try to build a model that would predict a patients output (tolerant or not) based on the PCs:

## L2 Regularized Linear Support Vector Machines with Class Weights 
## 
## 23 samples
## 23 predictors
##  2 classes: 'non_tolerant', 'tolerant' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 19, 19, 18, 18, 18 
## Resampling results across tuning parameters:
## 
##   cost  Loss  weight  Accuracy  Kappa    
##   0.25  L1    1       0.74      0.5032967
##   0.25  L1    2       0.74      0.5032967
##   0.25  L1    3       0.74      0.5032967
##   0.25  L2    1       0.79      0.5923077
##   0.25  L2    2       0.79      0.5923077
##   0.25  L2    3       0.83      0.6692308
##   0.50  L1    1       0.74      0.5032967
##   0.50  L1    2       0.74      0.5032967
##   0.50  L1    3       0.74      0.5032967
##   0.50  L2    1       0.74      0.4923077
##   0.50  L2    2       0.79      0.5923077
##   0.50  L2    3       0.83      0.6692308
##   1.00  L1    1       0.74      0.5032967
##   1.00  L1    2       0.74      0.5032967
##   1.00  L1    3       0.74      0.5032967
##   1.00  L2    1       0.74      0.4923077
##   1.00  L2    2       0.79      0.5923077
##   1.00  L2    3       0.83      0.6692308
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25, Loss = L2 and weight = 3.

The results haven’t been improved by the addition of the couple’s features.

Random forest modeling

To see if the features that we selected can help us to clssify the patients, we apply a 5 fold cross validation setting on the 23 patients for which we have CyTOF, metabolomics and transcriptomics information.

set.seed(1)
cv_id <- list(c(1:5), c(6:10), c(11:15), c(16:20), c(21:23))
cv_accuracies <- lapply(seq_along(cv_id), function(idx){
  train_set <- pca_2plot[-cv_id[[idx]], c("group", colnames(ft_integ)[-1])]
  test_set <- pca_2plot[cv_id[[idx]], c(colnames(ft_integ)[-1])]
  true_labels <- pca_2plot[cv_id[[idx]], "group"]
  rf <- ranger::ranger(group~., train_set, num.trees = 5000,
                           importance = "impurity")
  pred <- predict(rf, test_set)
  acc <- ( length(which(pred$predictions == true_labels))/length(true_labels) )
})

accuracy <- mean(unlist(cv_accuracies))
accuracy
## [1] 0.92

Separate data type analyses

For comparison, we can see if we would have managed to separate the tolerant from non tolerant recipients and couples as well without integrating the data:

CyTOF:

We perform PCA using only the 49 features that were selected as informative in the recipients and couples in both cohorts:

The first comoponent seems to be capturing most of the data variability:

We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC9 is the 2nd mostly correlated with tolerance.

The non tolerant recipients are slightly better separated from the tolerant recipients than in the integrative PCA.

We can also have a look at the features that are driving these two components:

We can try to build a model that would predict a patients output (tolerant or not) based on the PCs:

## L2 Regularized Linear Support Vector Machines with Class Weights 
## 
## 23 samples
## 10 predictors
##  2 classes: 'non_tolerant', 'tolerant' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 19, 19, 18, 18, 18 
## Resampling results across tuning parameters:
## 
##   cost  Loss  weight  Accuracy  Kappa    
##   0.25  L1    1       0.79      0.6142857
##   0.25  L1    2       0.84      0.7142857
##   0.25  L1    3       0.84      0.7142857
##   0.25  L2    1       0.75      0.5263736
##   0.25  L2    2       0.79      0.6142857
##   0.25  L2    3       0.75      0.5263736
##   0.50  L1    1       0.71      0.4494505
##   0.50  L1    2       0.75      0.5263736
##   0.50  L1    3       0.75      0.5263736
##   0.50  L2    1       0.79      0.5923077
##   0.50  L2    2       0.79      0.5923077
##   0.50  L2    3       0.75      0.5263736
##   1.00  L1    1       0.79      0.5904762
##   1.00  L1    2       0.79      0.5904762
##   1.00  L1    3       0.79      0.5904762
##   1.00  L2    1       0.79      0.5904762
##   1.00  L2    2       0.83      0.6564103
##   1.00  L2    3       0.79      0.5923077
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25, Loss = L1 and weight = 2.

The accuracy is quite lower than in the integrative model.

To see if the features that we selected can help us to clssify the patients, we can also apply a 5 fold cross validation setting on the CyTOF features of the same 23 patients.

set.seed(1)
info_rf <- info_complete_patients %>% column_to_rownames("Id.Cryostem.R")
info_rf <- info_rf[rownames(cyt4pca),]
cyt4rf <- cbind(info_rf$group, cyt4pca)
colnames(cyt4rf)[1] <- "group"

cv_id <- list(c(1:5), c(6:10), c(11:15), c(16:20), c(21:23))
cv_accuracies <- lapply(seq_along(cv_id), function(idx){
  train_set <- cyt4rf[-cv_id[[idx]], ]
  test_set <- cyt4rf[cv_id[[idx]], -1]
  true_labels <- cyt4rf[cv_id[[idx]], "group"]
  rf <- ranger::ranger(group~., train_set, num.trees = 5000,
                           importance = "impurity")
  pred <- predict(rf, test_set)
  acc <- ( length(which(pred$predictions == true_labels))/length(true_labels) )
})

accuracy <- mean(unlist(cv_accuracies))
accuracy
## [1] 0.84
Metabolomics:

Variability captured by each component:

We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC4 is the 2nd mostly correlated with tolerance.

The non tolerant recipients are slightly less well separated from the tolerant recipients than in the integrative PCA.

We can try to build a model that would predict a patients output (tolerant or not) based on the PCs:

## L2 Regularized Linear Support Vector Machines with Class Weights 
## 
## 23 samples
## 10 predictors
##  2 classes: 'non_tolerant', 'tolerant' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 19, 19, 18, 18, 18 
## Resampling results across tuning parameters:
## 
##   cost  Loss  weight  Accuracy  Kappa    
##   0.25  L1    1       0.83      0.6424242
##   0.25  L1    2       0.83      0.6424242
##   0.25  L1    3       0.83      0.6424242
##   0.25  L2    1       0.79      0.5783217
##   0.25  L2    2       0.83      0.6424242
##   0.25  L2    3       0.83      0.6424242
##   0.50  L1    1       0.83      0.6424242
##   0.50  L1    2       0.83      0.6424242
##   0.50  L1    3       0.83      0.6424242
##   0.50  L2    1       0.83      0.6424242
##   0.50  L2    2       0.79      0.5783217
##   0.50  L2    3       0.83      0.6424242
##   1.00  L1    1       0.83      0.6424242
##   1.00  L1    2       0.83      0.6424242
##   1.00  L1    3       0.83      0.6424242
##   1.00  L2    1       0.83      0.6424242
##   1.00  L2    2       0.83      0.6424242
##   1.00  L2    3       0.83      0.6424242
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25, Loss = L1 and weight = 1.

Again, the accuracy of the model is slightly lower than the accuracy of the integrative model.

To see if the features that we selected can help us to clssify the patients, we can also apply a 5 fold cross validation setting on the CyTOF features of the same 23 patients.

set.seed(1)
info_rf <- info_complete_patients %>% column_to_rownames("Id.Cryostem.R")
info_rf <- info_rf[rownames(met4pca),]
met4rf <- cbind(info_rf$group, met4pca)
colnames(met4rf)[1] <- "group"

cv_id <- list(c(1:5), c(6:10), c(11:15), c(16:20), c(21:23))
cv_accuracies <- lapply(seq_along(cv_id), function(idx){
  train_set <- met4rf[-cv_id[[idx]], ]
  test_set <- met4rf[cv_id[[idx]], -1]
  true_labels <- met4rf[cv_id[[idx]], "group"]
  rf <- ranger::ranger(group~., train_set, num.trees = 5000,
                           importance = "impurity")
  pred <- predict(rf, test_set)
  acc <- ( length(which(pred$predictions == true_labels))/length(true_labels) )
})

accuracy <- mean(unlist(cv_accuracies))
accuracy
## [1] 0.8266667
Transcriptomics:

Variability captured by each component:

We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC8 is the 2nd mostly correlated with tolerance.

The non tolerant recipients are slightly less well separated from the tolerant recipients than in the integrative PCA.

We can try to build a model that would predict a patients output (tolerant or not) based on the PCs:

## L2 Regularized Linear Support Vector Machines with Class Weights 
## 
## 23 samples
## 10 predictors
##  2 classes: 'non_tolerant', 'tolerant' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 19, 19, 18, 18, 18 
## Resampling results across tuning parameters:
## 
##   cost  Loss  weight  Accuracy  Kappa     
##   0.25  L1    1       0.48      -0.1034965
##   0.25  L1    2       0.48      -0.1034965
##   0.25  L1    3       0.48      -0.1034965
##   0.25  L2    1       0.44      -0.1641026
##   0.25  L2    2       0.44      -0.1641026
##   0.25  L2    3       0.44      -0.1641026
##   0.50  L1    1       0.44      -0.1641026
##   0.50  L1    2       0.44      -0.1641026
##   0.50  L1    3       0.44      -0.1641026
##   0.50  L2    1       0.44      -0.1641026
##   0.50  L2    2       0.44      -0.1641026
##   0.50  L2    3       0.44      -0.1641026
##   1.00  L1    1       0.44      -0.1641026
##   1.00  L1    2       0.44      -0.1641026
##   1.00  L1    3       0.44      -0.1641026
##   1.00  L2    1       0.44      -0.1641026
##   1.00  L2    2       0.44      -0.1641026
##   1.00  L2    3       0.44      -0.1641026
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25, Loss = L1 and weight = 1.

It looks like, in our integrative approach, we managed to separate the recipients based on their tolerance better than when we look at the different data types separately.

To see if the features that we selected can help us to clssify the patients, we can also apply a 5 fold cross validation setting on the CyTOF features of the same 23 patients.

set.seed(1)
info_rf <- info_complete_patients %>% column_to_rownames("Id.Cryostem.R")
info_rf <- info_rf[rownames(rna4pca),]
rna4rf <- cbind(info_rf$group, rna4pca)
colnames(rna4rf)[1] <- "group"

cv_id <- list(c(1:5), c(6:10), c(11:15), c(16:20), c(21:23))
cv_accuracies <- lapply(seq_along(cv_id), function(idx){
  train_set <- rna4rf[-cv_id[[idx]], ]
  test_set <- rna4rf[cv_id[[idx]], -1]
  true_labels <- rna4rf[cv_id[[idx]], "group"]
  rf <- ranger::ranger(group~., train_set, num.trees = 5000,
                           importance = "impurity")
  pred <- predict(rf, test_set)
  acc <- ( length(which(pred$predictions == true_labels))/length(true_labels) )
})

accuracy <- mean(unlist(cv_accuracies))
accuracy
## [1] 0.84

Looking at differences between primary and secondary recipients + couples

Since looking at differences between the three groups consisting of primary, secondary and non tolerant patients at the same time wasn’t feasible, we divided our problem in two distinct sub-problems. We now focus on the features that seemed to play a role when trying to disinguish primary tolerant from secondary tolerant recipients and couples.

I extract informations on the recipients group, age, gender, … to be able to use this information later on.

## [1] "0"
## [1] "1"
##                  GROUP GENDER COUPLENUMBER        DOG TIMEPPOINT gender_comp
## D2497     non_tolerant      F            1       <NA>       D-15          FF
## R2498     non_tolerant      F            1 2014-08-06         Y2          FF
## D284      non_tolerant      F            2       <NA>       D-15          FM
## R385      non_tolerant      M            2 2013-05-02         Y2          FM
## D1502 primary_tolerant      F            3       <NA>       D-15          FM
## R1503 primary_tolerant      M            3 2014-01-02         Y2          FM
##       age_recip cmv_comp
## D2497     21794       01
## R2498     21794       01
## D284      21687      NA1
## R385      21687      NA1
## D1502      8264       11
## R1503      8264       11

We then need to gather the features that were extracted from the three data sources:

  • CyTOF features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.

  • Metabolomics features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.

  • Transcriptomics features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.

I merge all features together:

## [1] "We have information on the CyTOF, metabolomics and transcriptomics data for 10 patients in the St Louis cohort."

Principal component analysis

We can first generate a PCA on the data:

The first principal component seems to hold much more information that the others

We can investigate how much of the principal components are driven by the different data types.

The RNAseq data variability was captured much more than the other data types. This is most probably due to the fact that we selected 365 genes, against 45 metabolites and 49 CyTOF features only.

We can also see how the features coming from recipients and couples are distributed in the PCs:

## Warning: Removed 10 rows containing missing values (position_stack).

We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC4 (and slightly PC9 and PC10) also look correlated with tolerance.

We can then see how the PCs are correlated with the different informations that we have on the patients.

The first PC seems to be most correlated with group, and SAL:

The second PC seems to be correlated with the hematologic disorder group:

The third PC seems to be slightly correlated with cmv compatibility:

The fourth PC seems to be most correlated with age_recip:

Since the 1st PC seems to be most correlated with tolerance, we visualise the patients in the two dimensions that carry most variability: the PCs 1 and 2. We observe that these 2 components seem to separate the tolerant recipients (on the bottom-left) from the non tolerant recipients (on the top-right). We can also see how the recipients are displayed in these two dimensions according to ABO incompatibility, recipients age and cGVHD.

Features expressed in the first principal component

Since most information related to tolerance is carried in the PC1, we look into it: We first focus on the features that are overexpressed in the primary tolerant recipients (= the features that are most expressed on the LEFT of the PC1). Therefore, the lower the feature value, the more it is expressed in primary tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file

##                                                  feature      overexp orig
## 1                                               R_CDC25A -0.209143449  rna
## 2                                                R_TOP2A -0.194526288  rna
## 3                                                R_PCLAF -0.193864230  rna
## 4                                                 R_E2F8 -0.190771808  rna
## 5                                                R_MZT2B -0.188426183  rna
## 6                                                R_MKI67 -0.187969410  rna
## 7  RD_subst_X1.palmitoleoyl.2.linoleoyl.GPC..16.1.18.2.. -0.184393000  met
## 8                                               R_FAM83D -0.184121574  rna
## 9                                                R_UBE2C -0.180719661  rna
## 10                                             R_SNORD17 -0.179272604  rna
## 11                                                R_MAPT -0.174917576  rna
## 12                                                R_ASB2 -0.171767208  rna
## 13                                     R_ENSG00000233597 -0.166052134  rna
## 14                                     R_ENSG00000210107 -0.160281709  rna
## 15                                               R_KLRC2 -0.158573027  rna
## 16                                     R_ENSG00000280981 -0.152324328  rna
## 17                                             R_CCDC157 -0.148317283  rna
## 18                                               R_AGMAT -0.144357349  rna
## 19                                        R_LOC101929574 -0.140194063  rna
## 20                                               R_ASAP3 -0.132761128  rna
## 21                                               RD_FMN1 -0.126501337  rna
## 22                                            RD_RPS6KA6 -0.120511630  rna
## 23                     RD_subst_X2.palmitoyl.GPC..16.0.. -0.107207029  met
## 24                                              R_TMEM53 -0.097197500  rna
## 25                      RD_subst_sphingosine.1.phosphate -0.091362475  met
## 26                                 R_X3.aminoisobutyrate -0.088735676  met
## 27 RD_subst_X1.oleoyl.2.docosahexaenoyl.GPC..18.1.22.6.. -0.080939552  met
## 28                                  RD_subst_sphingosine -0.073248413  met
## 29               RD_X.ICOS..9_MoDCs..99.....CD4.TEM..1.. -0.072511683  cyt
## 30                                          R_GPRC5D.AS1 -0.044119036  rna
## 31                                          RD_TAF1A.AS1 -0.009814738  rna

We then focus on the features that are overexpressed in the secondary tolerant recipients (= the features that are most expressed on the RIGHT of the PC1). Therefore, the higher the feature value, the more it is expressed in secondary tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the highest features in the PC1 (over the 0.5 threshold) :

##                                 feature     overexp orig type
## 1                              RD_TOP2A 0.187593732  rna   RD
## 2                               RD_EME1 0.183332940  rna   RD
## 3                               RD_ANLN 0.182962995  rna   RD
## 4                                RD_TTK 0.175694401  rna   RD
## 5                              RD_PCLAF 0.174547675  rna   RD
## 6                             RD_CDC25A 0.171601060  rna   RD
## 7                    RD_ENSG00000219747 0.164835507  rna   RD
## 8                    RD_ENSG00000228232 0.142205465  rna   RD
## 9                               RD_QPRT 0.134946340  rna   RD
## 10                   RD_ENSG00000279347 0.121060179  rna   RD
## 11                              RD_DDTL 0.110530376  rna   RD
## 12 R_X.CD38._16_T.population.Naive.TSCM 0.091335956  cyt    R
## 13           R_X2.palmitoyl.GPC..16.0.. 0.089620809  met    R
## 14                             R_PRUNE2 0.076552080  rna    R
## 15                    R_ENSG00000256658 0.059149380  rna    R
## 16                 R_X.CD38._29_DP.Treg 0.023498272  cyt    R
## 17                              R_HACD1 0.016009319  rna    R
## 18                   RD_ENSG00000240568 0.009657358  rna   RD

Adding information on the donors to the recipients

We can then add the features that we extracted as informative in the donors, to see if, taken together, they would be more informative than when working on the integrated data of recipients alone.

Tolerant versus non tolerant recipients + donors

Since looking at differences between the three groups consisting of primary, secondary and non tolerant patients at the same time wasn’t feasible, we divided our problem in two distinct sub-problems. We first focus on the features that seemed to play a role when trying to disinguish non tolerant from tolerant recipients and donors.

I extract informations on the recipients group, age, gender, … to be able to use this information later on.

## [1] "0"
## [1] "1"
##          Id.Cryostem.R            GROUP GENDER        DOB        DOG
## D1_TOL           D2497     non_tolerant      F 1962-07-06       <NA>
## R1_2_TOL         R2498     non_tolerant      F 1954-12-05 2014-08-06
## D2_NoTOL          D284     non_tolerant      F 1967-04-03       <NA>
## R2_NoTOL          R385     non_tolerant      M 1953-12-16 2013-05-02
## D3_TOL           D1502 primary_tolerant      F 1990-04-03       <NA>
## R3_1_TOL         R1503 primary_tolerant      M 1991-05-19 2014-01-02
##          gender_comp age_recip cmv_comp
## D1_TOL            FF     21794       01
## R1_2_TOL          FF     21794       01
## D2_NoTOL          FM     21687      NA1
## R2_NoTOL          FM     21687      NA1
## D3_TOL            FM      8264       11
## R3_1_TOL          FM      8264       11

We then need to gather the features that were extracted from the three data sources:

  • CyTOF features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.

  • Metabolomics features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.

  • Transcriptomics features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.

I merge all features together:

## [1] "We have information on the CyTOF, metabolomics and transcriptomics data for 23 patients in the St Louis cohort."

Principal component analysis

We can first generate a PCA on the data:

The first principal component seems to hold much more information than the others

We can investigate how much of the principal components are driven by the different data types.

We can see the data origin (CyTOF, metabolomics or transcriptomics) of the variables in each PC.

We can also see how the features coming from recipients and donors are distributed in the PCs:

We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC4 (and slightly PC8 and PC10) also look correlated with tolerance.

We can then see how the PCs are correlated with the different informations that we have on the patients.

The first PC seems to be most correlated with group, cGVHD and ABO.incompatibility:

The second PC seems to be slightly correlated with the disorder group:

The third PC seems to be slightly correlated with aGVHD:

The fourth PC seems to be most correlated with group and age_recip:

Since the 1st and 4th PCs seem to be most correlated with tolerance, we visualise the patients in these two dimensions. We observe that these 2 components seem to separate the tolerant recipients (on the bottom-left) from the non tolerant recipients (on the top-right). We can also see how the recipients are displayed in these two dimensions according to ABO incompatibility, recipients age and cGVHD.

Features expressed in the first principal component

We can first look at the PC1, since it holds most of the data’s variability: We first focus on the features that are overexpressed in the tolerant recipients (= the features that are most expressed on the LEFT of the PC1). Therefore, the lower the feature value, the more it is expressed in tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the lowest features in the PC1 (under the -0.6 threshold) :

##                                            feature     overexp orig type
## 1                                R_ENSG00000228956 -0.08075653  rna    R
## 2                                R_ENSG00000242861 -0.07951634  rna    R
## 3                                         R_CNKSR2 -0.07867695  rna    R
## 4                                         R_ABLIM1 -0.07775230  rna    R
## 5                                          R_SPON1 -0.07686619  rna    R
## 6                                R_ENSG00000222078 -0.07683521  rna    R
## 7                                          R_BACH2 -0.07638486  rna    R
## 8                                           R_HSF5 -0.07604186  rna    R
## 9                                          R_AXIN2 -0.07602336  rna    R
## 10                               R_ENSG00000280123 -0.07556144  rna    R
## 11                                         R_KLHL3 -0.07533858  rna    R
## 12                                      R_IFNG.AS1 -0.07519686  rna    R
## 13                                     R_LINC01215 -0.07516668  rna    R
## 14                                      R_C12orf42 -0.07509667  rna    R
## 15                                     R_LINC01259 -0.07426642  rna    R
## 16                                         R_NELL2 -0.07405004  rna    R
## 17                                        R_DCBLD2 -0.07389167  rna    R
## 18                                      R_KLF3.AS1 -0.07383617  rna    R
## 19                          R_androsterone.sulfate -0.07353145  met    R
## 20                                    R_LACTB2.AS1 -0.07348986  rna    R
## 21                               R_ENSG00000261685 -0.07309564  rna    R
## 22                                        R_RNU6.8 -0.07301855  rna    R
## 23                                       R_GPRASP1 -0.07241775  rna    R
## 24                                         R_BEND5 -0.07234925  rna    R
## 25                                       R_RETREG1 -0.07217920  rna    R
## 26       R_dehydroisoandrosterone.sulfate..DHEA.S. -0.07217218  met    R
## 27                                      R_SBF2.AS1 -0.07177514  rna    R
## 28                                           R_MCC -0.07134993  rna    R
## 29                                         R_FBLN5 -0.07125529  rna    R
## 30                                        R_LARGE1 -0.07103803  rna    R
## 31                                          R_RIC3 -0.07100378  rna    R
## 32                                         R_NPAS2 -0.07044973  rna    R
## 33                                  R_LOC100507053 -0.07037711  rna    R
## 34                                        R_ZNF256 -0.07023252  rna    R
## 35                                         R_STRBP -0.07017166  rna    R
## 36 R_androstenediol..3beta.17beta..monosulfate..1. -0.07001620  met    R
## 37                                          R_FCMR -0.06978687  rna    R
## 38                       R_epiandrosterone.sulfate -0.06956086  met    R
## 39                               R_ENSG00000218730 -0.06951024  rna    R
## 40                                         R_PARM1 -0.06936988  rna    R
## 41                                  R_LOC105369827 -0.06929755  rna    R
## 42                               R_ENSG00000276097 -0.06912298  rna    R
## 43                                         R_KCNH8 -0.06901420  rna    R
## 44                                      R_GOLGA6L9 -0.06889562  rna    R
## 45                                        R_PAIP2B -0.06878330  rna    R
## 46                                         R_BCL7A -0.06862227  rna    R
## 47                                       R_ALDH5A1 -0.06834278  rna    R
## 48                                          R_SDK2 -0.06815060  rna    R
## 49                                          R_NT5E -0.06784976  rna    R
## 50                               R_ENSG00000224610 -0.06778231  rna    R
## 51                                        R_SPAG16 -0.06772607  rna    R
## 52                               R_ENSG00000211611 -0.06766079  rna    R
## 53                               R_ENSG00000270659 -0.06757890  rna    R
## 54                                      R_C19orf18 -0.06728761  rna    R
## 55                                         R_ZNF90 -0.06722372  rna    R
## 56                                         R_TCTN1 -0.06709662  rna    R
## 57                                         R_EPHA4 -0.06694079  rna    R
## 58                                         R_USP44 -0.06680327  rna    R
## 59                                      R_UBE2Q2P2 -0.06653659  rna    R
## 60                                        R_COBLL1 -0.06631573  rna    R
## 61                                        R_RASSF6 -0.06619258  rna    R
## 62                                          R_CLNK -0.06585953  rna    R
## 63                                         R_ACSM3 -0.06569360  rna    R
## 64                                         R_ABCB4 -0.06562701  rna    R
## 65                               R_ENSG00000218713 -0.06557909  rna    R
## 66    R_X1.stearoyl.2.arachidonoyl.GPC..18.0.20.4. -0.06545521  met    R
## 67                                        R_ZNF711 -0.06534418  rna    R
## 68                                       R_CNTNAP2 -0.06521699  rna    R
## 69                                          R_TTC9 -0.06479820  rna    R
## 70                                           R_OCM -0.06475870  rna    R
## 71                                       R_B3GALT2 -0.06458462  rna    R
## 72                                       R_PPFIBP1 -0.06424149  rna    R
## 73                               R_ENSG00000214797 -0.06397080  rna    R
## 74                               R_ENSG00000211679 -0.06387636  rna    R
## 75                                         R_SYNPO -0.06384984  rna    R
## 76                                         R_PLPP5 -0.06344037  rna    R
## 77                                        R_AMIGO1 -0.06341650  rna    R
## 78                                       R_MIR3679 -0.06329979  rna    R
## 79                                         R_IL23R -0.06328683  rna    R
## 80                                        R_ZNF540 -0.06313709  rna    R
## 81         R_X23_DN..86.....CD8low..13...TSCM.like -0.06299159  cyt    R
## 82                               R_ENSG00000225569 -0.06268405  rna    R
## 83                                          R_CHD7 -0.06257571  rna    R
## 84                                        R_ZNF165 -0.06246526  rna    R
## 85                                         R_LAMP3 -0.06235625  rna    R
## 86                                        R_P2RY10 -0.06233964  rna    R
## 87                                         R_NRCAM -0.06188135  rna    R
## 88                                          R_CDNF -0.06182447  rna    R
## 89                                          R_DAB1 -0.06159775  rna    R
## 90                                         R_OVGP1 -0.06153708  rna    R
## 91                               R_ENSG00000254777 -0.06134844  rna    R
## 92                               R_ENSG00000235251 -0.06126092  rna    R
## 93                               R_ENSG00000211820 -0.06124757  rna    R
## 94                                        R_MARCH3 -0.06089070  rna    R
## 95                               R_ENSG00000234902 -0.06079997  rna    R
## 96                               R_ENSG00000273062 -0.06079094  rna    R
## 97                                         R_TTC28 -0.06020687  rna    R
## 98                               R_ENSG00000144642 -0.06015380  rna    R
## 99                                       R_SDR42E1 -0.06002544  rna    R

We then focus on the features that are overexpressed in the non tolerant recipients (= the features that are most expressed on the RIGHT of the PC1). Therefore, the higher the feature value, the more it is expressed in non tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the highest features in the PC1 (over the 0.5 threshold) :

##                                                                        feature
## 1                                                                     R_SRGAP1
## 2                                                                      R_P2RY1
## 3                                                                     R_FLVCR2
## 4                                                                       R_C1QC
## 5                                                                       R_IL10
## 6                                                                    R_ADAMTS2
## 7                                                                     R_PCYT1B
## 8                                                                       R_TCN2
## 9                                                                       R_MAFB
## 10                                                                    R_TRIM58
## 11                                                                      R_MMP8
## 12                                                                      R_C1QB
## 13                                                                     R_MEIS1
## 14                                                                      R_C1QA
## 15                                                                     R_ACSL1
## 16                                                                      R_TNS1
## 17                                                                     R_LTBP1
## 18                    R_X.CD38._10_T.population.CD8...DP.CD4..FoxP3...40.70...
## 19                                                                     R_SPARC
## 20                                                                    R_SH3TC2
## 21                                                                   R_ARHGAP6
## 22                                                                     R_ITGB3
## 23                                                                      R_SELP
## 24                                                                      R_MAOB
## 25                                                                      R_LCN2
## 26                                                                    R_ABLIM3
## 27                                                                     R_TTC7B
## 28                                                                    R_CTDSPL
## 29                                                                    R_LANCL3
## 30                                                                    R_MFSD2B
## 31                                                       R_X.CD38._27_CD4.Treg
## 32                            R_X.CD38._6_CD8.T.cells.FoxP3..CXCR3..CCR7..FAS.
## 33           R_X.CD38._20_NK.cells..66.....Conventional.DCs...some.moDCs..34..
## 34                                             R_X.CTLA4._24_CD4.TCM.Th17.like
## 35                                                                       R_CA1
## 36                                                                      R_SDC3
## 37                                                                  R_SH3PXD2B
## 38                                                                     R_PEAR1
## 39 R_X.CD38._17_CD8...87...DP..10....TEMRA..70....TSCM..30....CXCR5..B.markers
## 40                                                                      R_FBP1
## 41                                                                   R_ANKRD22
##       overexp orig type
## 1  0.07097766  rna    R
## 2  0.06996857  rna    R
## 3  0.06937094  rna    R
## 4  0.06936852  rna    R
## 5  0.06919927  rna    R
## 6  0.06810588  rna    R
## 7  0.06496815  rna    R
## 8  0.06462919  rna    R
## 9  0.06449817  rna    R
## 10 0.06341289  rna    R
## 11 0.06302870  rna    R
## 12 0.06253024  rna    R
## 13 0.06234157  rna    R
## 14 0.06219826  rna    R
## 15 0.06202042  rna    R
## 16 0.06176943  rna    R
## 17 0.06034922  rna    R
## 18 0.05929661  cyt    R
## 19 0.05907008  rna    R
## 20 0.05842199  rna    R
## 21 0.05829862  rna    R
## 22 0.05761459  rna    R
## 23 0.05746313  rna    R
## 24 0.05738510  rna    R
## 25 0.05737555  rna    R
## 26 0.05705676  rna    R
## 27 0.05600048  rna    R
## 28 0.05578129  rna    R
## 29 0.05559423  rna    R
## 30 0.05473175  rna    R
## 31 0.05417959  cyt    R
## 32 0.05413699  cyt    R
## 33 0.05402362  cyt    R
## 34 0.05283975  cyt    R
## 35 0.05268783  rna    R
## 36 0.05260759  rna    R
## 37 0.05253019  rna    R
## 38 0.05202059  rna    R
## 39 0.05152623  cyt    R
## 40 0.05132304  rna    R
## 41 0.05077035  rna    R
Features expressed in the fourth principal component

We can first look at the PC1, since it holds most of the data’s variability: We first focus on the features that are overexpressed in the tolerant recipients (= the features that are most expressed on the LEFT of the PC1). Therefore, the lower the feature value, the more it is expressed in tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the lowest features in the PC1 (under the -0.6 threshold) :

##                                          feature    overexp orig type
## 1                                     R_PDCD1LG2 -0.1402448  rna    R
## 2                                       R_GBP1P1 -0.1272238  rna    R
## 3                                         R_GBP5 -0.1163539  rna    R
## 4                                    R_LINC02528 -0.1088952  rna    R
## 5             R_methyl.4.hydroxybenzoate.sulfate -0.1042471  met    R
## 6 R_X.CD38._23_DN..86.....CD8low..13...TSCM.like -0.1024297  cyt    R
## 7                                       R_P2RY10 -0.1023204  rna    R

We then focus on the features that are overexpressed in the non tolerant recipients (= the features that are most expressed on the RIGHT of the PC1). Therefore, the higher the feature value, the more it is expressed in non tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the highest features in the PC1 (over the 0.5 threshold) :

##                                                                       feature
## 1                                                              R_beta.alanine
## 2                                                                   R_leucine
## 3  R_X4_CD8.Naives.CXCR3...60....CD8.naives..28...CD8.TSCM..16...CD8.TCM..7..
## 4                                                                   R_alanine
## 5                                                         R_choline.phosphate
## 6                                                                R_isoleucine
## 7                                                                R_methionine
## 8                                                                      D_TCF7
## 9                                                                    R_valine
## 10                                                                D_LINC01238
## 11                                                  R_X3.methyl.2.oxovalerate
## 12                                                                R_histidine
## 13                                                               R_asparagine
##      overexp orig type
## 1  0.1392457  met    R
## 2  0.1331654  met    R
## 3  0.1222863  cyt    R
## 4  0.1148477  met    R
## 5  0.1135772  met    R
## 6  0.1116683  met    R
## 7  0.1066787  met    R
## 8  0.1047017  rna    D
## 9  0.1024229  met    R
## 10 0.1019528  rna    D
## 11 0.1018347  met    R
## 12 0.1002932  met    R
## 13 0.1001522  met    R

We can try to build a model that would predict a patients output (tolerant or not) based on the PCs:

## L2 Regularized Linear Support Vector Machines with Class Weights 
## 
## 23 samples
## 23 predictors
##  2 classes: 'non_tolerant', 'tolerant' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 19, 19, 18, 18, 18 
## Resampling results across tuning parameters:
## 
##   cost  Loss  weight  Accuracy  Kappa    
##   0.25  L1    1       0.82      0.6571429
##   0.25  L1    2       0.82      0.6571429
##   0.25  L1    3       0.82      0.6571429
##   0.25  L2    1       0.82      0.6571429
##   0.25  L2    2       0.82      0.6571429
##   0.25  L2    3       0.78      0.5802198
##   0.50  L1    1       0.82      0.6571429
##   0.50  L1    2       0.82      0.6571429
##   0.50  L1    3       0.82      0.6571429
##   0.50  L2    1       0.82      0.6571429
##   0.50  L2    2       0.78      0.5802198
##   0.50  L2    3       0.78      0.5802198
##   1.00  L1    1       0.82      0.6571429
##   1.00  L1    2       0.82      0.6571429
##   1.00  L1    3       0.82      0.6571429
##   1.00  L2    1       0.82      0.6571429
##   1.00  L2    2       0.78      0.5802198
##   1.00  L2    3       0.78      0.5802198
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25, Loss = L1 and weight = 1.

The results haven’t been improved by the addition of the couple’s features.

Correlations between features

We can now see how the variables that we extracted from recipients and donors are coorelated. Correlations between variables are represented by edges in the graph. We kept only correlations that were common between the St Louis and the Cryostem cohort.

Random forest modeling

To see if the features that we selected can help us to clssify the patients, we apply a 5 fold cross validation setting on the 23 patients for which we have CyTOF, metabolomics and transcriptomics information.

set.seed(1)
cv_id <- list(c(1:5), c(6:10), c(11:15), c(16:20), c(21:23))
cv_accuracies <- lapply(seq_along(cv_id), function(idx){
  train_set <- pca_2plot[-cv_id[[idx]], c("group", colnames(ft_integ)[-1])]
  test_set <- pca_2plot[cv_id[[idx]], c(colnames(ft_integ)[-1])]
  true_labels <- pca_2plot[cv_id[[idx]], "group"]
  rf <- ranger::ranger(group~., train_set, num.trees = 5000,
                           importance = "impurity")
  pred <- predict(rf, test_set)
  acc <- ( length(which(pred$predictions == true_labels))/length(true_labels) )
})

accuracy <- mean(unlist(cv_accuracies))
accuracy
## [1] 0.88

Separate data type analyses

For comparison, we can see if we would have managed to separate the tolerant from non tolerant recipients and couples as well without integrating the data:

CyTOF:

We perform PCA using only the 49 features that were selected as informative in the recipients and couples in both cohorts:

The first comoponent seems to be capturing most of the data variability:

We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC10 is the 2nd mostly correlated with tolerance.

The non tolerant recipients are slightly better separated from the tolerant recipients than in the integrative PCA.

We can also have a look at the features that are driving these two components:

We can try to build a model that would predict a patients output (tolerant or not) based on the PCs:

## L2 Regularized Linear Support Vector Machines with Class Weights 
## 
## 23 samples
## 10 predictors
##  2 classes: 'non_tolerant', 'tolerant' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 19, 19, 18, 18, 18 
## Resampling results across tuning parameters:
## 
##   cost  Loss  weight  Accuracy  Kappa    
##   0.25  L1    1       0.75      0.5373626
##   0.25  L1    2       0.75      0.5373626
##   0.25  L1    3       0.75      0.5373626
##   0.25  L2    1       0.79      0.6032967
##   0.25  L2    2       0.84      0.7032967
##   0.25  L2    3       0.84      0.7032967
##   0.50  L1    1       0.71      0.4476190
##   0.50  L1    2       0.71      0.4476190
##   0.50  L1    3       0.71      0.4476190
##   0.50  L2    1       0.71      0.4476190
##   0.50  L2    2       0.75      0.5135531
##   0.50  L2    3       0.75      0.5135531
##   1.00  L1    1       0.71      0.4476190
##   1.00  L1    2       0.71      0.4476190
##   1.00  L1    3       0.71      0.4476190
##   1.00  L2    1       0.71      0.4476190
##   1.00  L2    2       0.75      0.5135531
##   1.00  L2    3       0.75      0.5135531
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25, Loss = L2 and weight = 2.

The accuracy is quite lower than in the integrative model and hasn’t improved with the addition of features from the donors.

To see if the features that we selected can help us to clssify the patients, we can also apply a 5 fold cross validation setting on the CyTOF features of the same 23 patients.

set.seed(1)
info_rf <- info_complete_patients %>% column_to_rownames("Id.Cryostem.R")
info_rf <- info_rf[rownames(cyt4pca),]
cyt4rf <- cbind(info_rf$group, cyt4pca)
colnames(cyt4rf)[1] <- "group"

cv_id <- list(c(1:5), c(6:10), c(11:15), c(16:20), c(21:23))
cv_accuracies <- lapply(seq_along(cv_id), function(idx){
  train_set <- cyt4rf[-cv_id[[idx]], ]
  test_set <- cyt4rf[cv_id[[idx]], -1]
  true_labels <- cyt4rf[cv_id[[idx]], "group"]
  rf <- ranger::ranger(group~., train_set, num.trees = 5000,
                           importance = "impurity")
  pred <- predict(rf, test_set)
  acc <- ( length(which(pred$predictions == true_labels))/length(true_labels) )
})

accuracy <- mean(unlist(cv_accuracies))
accuracy
## [1] 0.88
Metabolomics:

Variability captured by each component:

We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC5 is the 2nd mostly correlated with tolerance.

The non tolerant recipients are slightly less well separated from the tolerant recipients than in the integrative PCA.

We can try to build a model that would predict a patients output (tolerant or not) based on the PCs:

## L2 Regularized Linear Support Vector Machines with Class Weights 
## 
## 23 samples
## 10 predictors
##  2 classes: 'non_tolerant', 'tolerant' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 19, 19, 18, 18, 18 
## Resampling results across tuning parameters:
## 
##   cost  Loss  weight  Accuracy  Kappa    
##   0.25  L1    1       0.78      0.5424242
##   0.25  L1    2       0.78      0.5424242
##   0.25  L1    3       0.78      0.5424242
##   0.25  L2    1       0.78      0.5412587
##   0.25  L2    2       0.74      0.4655012
##   0.25  L2    3       0.65      0.2897436
##   0.50  L1    1       0.69      0.3655012
##   0.50  L1    2       0.69      0.3655012
##   0.50  L1    3       0.69      0.3655012
##   0.50  L2    1       0.69      0.3655012
##   0.50  L2    2       0.69      0.3655012
##   0.50  L2    3       0.65      0.2897436
##   1.00  L1    1       0.64      0.2655012
##   1.00  L1    2       0.64      0.2655012
##   1.00  L1    3       0.64      0.2655012
##   1.00  L2    1       0.73      0.4412587
##   1.00  L2    2       0.69      0.3655012
##   1.00  L2    3       0.69      0.3655012
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25, Loss = L1 and weight = 1.

Again, the accuracy of the model is slightly lower than the accuracy of the integrative model.

To see if the features that we selected can help us to clssify the patients, we can also apply a 5 fold cross validation setting on the CyTOF features of the same 23 patients.

set.seed(1)
info_rf <- info_complete_patients %>% column_to_rownames("Id.Cryostem.R")
info_rf <- info_rf[rownames(met4pca),]
met4rf <- cbind(info_rf$group, met4pca)
colnames(met4rf)[1] <- "group"

cv_id <- list(c(1:5), c(6:10), c(11:15), c(16:20), c(21:23))
cv_accuracies <- lapply(seq_along(cv_id), function(idx){
  train_set <- met4rf[-cv_id[[idx]], ]
  test_set <- met4rf[cv_id[[idx]], -1]
  true_labels <- met4rf[cv_id[[idx]], "group"]
  rf <- ranger::ranger(group~., train_set, num.trees = 5000,
                           importance = "impurity")
  pred <- predict(rf, test_set)
  acc <- ( length(which(pred$predictions == true_labels))/length(true_labels) )
})

accuracy <- mean(unlist(cv_accuracies))
accuracy
## [1] 0.7866667
Transcriptomics:

Variability captured by each component:

We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC8 is the 2nd mostly correlated with tolerance.

The non tolerant recipients are slightly less well separated from the tolerant recipients than in the integrative PCA.

We can try to build a model that would predict a patients output (tolerant or not) based on the PCs:

## L2 Regularized Linear Support Vector Machines with Class Weights 
## 
## 23 samples
## 10 predictors
##  2 classes: 'non_tolerant', 'tolerant' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 19, 19, 18, 18, 18 
## Resampling results across tuning parameters:
## 
##   cost  Loss  weight  Accuracy  Kappa       
##   0.25  L1    1       0.61       0.190476190
##   0.25  L1    2       0.61       0.190476190
##   0.25  L1    3       0.61       0.190476190
##   0.25  L2    1       0.57       0.102564103
##   0.25  L2    2       0.57       0.102564103
##   0.25  L2    3       0.61       0.190476190
##   0.50  L1    1       0.61       0.190476190
##   0.50  L1    2       0.61       0.190476190
##   0.50  L1    3       0.61       0.190476190
##   0.50  L2    1       0.57       0.102564103
##   0.50  L2    2       0.57       0.102564103
##   0.50  L2    3       0.61       0.190476190
##   1.00  L1    1       0.61       0.190476190
##   1.00  L1    2       0.61       0.190476190
##   1.00  L1    3       0.61       0.190476190
##   1.00  L2    1       0.53      -0.003496503
##   1.00  L2    2       0.57       0.102564103
##   1.00  L2    3       0.61       0.190476190
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25, Loss = L1 and weight = 1.

It looks like, in our integrative approach, we managed to separate the recipients based on their tolerance better than when we look at the different data types separately.

To see if the features that we selected can help us to clssify the patients, we can also apply a 5 fold cross validation setting on the CyTOF features of the same 23 patients.

set.seed(1)
info_rf <- info_complete_patients %>% column_to_rownames("Id.Cryostem.R")
info_rf <- info_rf[rownames(rna4pca),]
rna4rf <- cbind(info_rf$group, rna4pca)
colnames(rna4rf)[1] <- "group"

cv_id <- list(c(1:5), c(6:10), c(11:15), c(16:20), c(21:23))
cv_accuracies <- lapply(seq_along(cv_id), function(idx){
  train_set <- rna4rf[-cv_id[[idx]], ]
  test_set <- rna4rf[cv_id[[idx]], -1]
  true_labels <- rna4rf[cv_id[[idx]], "group"]
  rf <- ranger::ranger(group~., train_set, num.trees = 5000,
                           importance = "impurity")
  pred <- predict(rf, test_set)
  acc <- ( length(which(pred$predictions == true_labels))/length(true_labels) )
})

accuracy <- mean(unlist(cv_accuracies))
accuracy
## [1] 0.8

Primary tolerant versus secondary tolerant recipients + donors

Since looking at differences between the three groups consisting of primary, secondary and non tolerant patients at the same time wasn’t feasible, we divided our problem in two distinct sub-problems. We now focus on the features that seemed to play a role when trying to disinguish primary tolerant from secondary tolerant recipients.

I extract informations on the recipients group, age, gender, … to be able to use this information later on.

## [1] "0"
## [1] "1"
##                  GROUP GENDER COUPLENUMBER        DOG TIMEPPOINT gender_comp
## D2497     non_tolerant      F            1       <NA>       D-15          FF
## R2498     non_tolerant      F            1 2014-08-06         Y2          FF
## D284      non_tolerant      F            2       <NA>       D-15          FM
## R385      non_tolerant      M            2 2013-05-02         Y2          FM
## D1502 primary_tolerant      F            3       <NA>       D-15          FM
## R1503 primary_tolerant      M            3 2014-01-02         Y2          FM
##       age_recip cmv_comp
## D2497     21794       01
## R2498     21794       01
## D284      21687      NA1
## R385      21687      NA1
## D1502      8264       11
## R1503      8264       11

We then need to gather the features that were extracted from the three data sources:

  • CyTOF features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.

  • Metabolomics features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.

  • Transcriptomics features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.

I merge all features together:

## [1] "We have information on the CyTOF, metabolomics and transcriptomics data for 10 patients in the St Louis cohort."

Principal component analysis

We can first generate a PCA on the data:

The first principal component seems to hold much more information that the others

We can investigate how much of the principal components are driven by the different data types.

We can see the data origin (CyTOF, metabolomics or transcriptomics) of the variables in each PC.

We can also see how the features coming from recipients and donors are distributed in the PCs:

## Warning: Removed 10 rows containing missing values (position_stack).

We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC4 (and slightly PC8 and PC10) also look correlated with tolerance.

We can then see how the PCs are correlated with the different informations that we have on the patients.

The first PC seems to be most correlated with group, SAL, cGVHD and ABO.incompatibility:

The second PC seems to be slightly correlated with the hematologic disorder, source of graft and tolerance group:

The third PC seems to be slightly correlated with cmv compatibility between donors and recipients:

The fourth PC seems to be most correlated with gender compatibility between donors and recipients:

Since the 1st and 2nd PCs seem to be most correlated with tolerance, we visualise the patients in these two dimensions. We observe that these 2 components seem to separate the primary tolerant recipients (on the left) from the secondary tolerant recipients (on the top-right). We can also see how the recipients are displayed in these two dimensions according to ABO incompatibility, recipients age and cGVHD.

Features expressed in the first prinipal component

We can first look at the PC1, since it holds most of the data’s variability: We first focus on the features that are overexpressed in the primary tolerant recipients (= the features that are most expressed on the LEFT of the PC1). Therefore, the lower the feature value, the more it is expressed in primary tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the lowest features in the PC1 (under the -0.6 threshold) :

##                                           feature       overexp orig type
## 1                                        R_FAM83D -0.2056692745  rna    R
## 2                                        R_CDC25A -0.2033735681  rna    R
## 3                                          R_E2F8 -0.2028409289  rna    R
## 4                                       R_SNORD17 -0.2022515018  rna    R
## 5                               R_ENSG00000280981 -0.2013883103  rna    R
## 6                                          R_ASB2 -0.2013412175  rna    R
## 7                                         R_MZT2B -0.2004410068  rna    R
## 8                                         R_PCLAF -0.1999850192  rna    R
## 9                               R_ENSG00000233597 -0.1996197448  rna    R
## 10                                        R_TOP2A -0.1982321674  rna    R
## 11                                 D_LOC100506544 -0.1978123907  rna    D
## 12                                        R_MKI67 -0.1903894612  rna    R
## 13                                        R_UBE2C -0.1890480591  rna    R
## 14                              R_ENSG00000210107 -0.1869489618  rna    R
## 15                              D_ENSG00000255438 -0.1862347291  rna    D
## 16                                        R_KLRC2 -0.1786339752  rna    R
## 17                                         D_MEG3 -0.1763948665  rna    D
## 18                                         R_MAPT -0.1744884788  rna    R
## 19                              D_ENSG00000273599 -0.1685596174  rna    D
## 20                                      R_CCDC157 -0.1578968103  rna    R
## 21                                        D_TMCC2 -0.1568513166  rna    D
## 22                              D_ENSG00000260425 -0.1508572476  rna    D
## 23                                        R_AGMAT -0.1483518213  rna    R
## 24                                        R_ASAP3 -0.1414683602  rna    R
## 25                                       D_SYCP2L -0.1407207372  rna    D
## 26                                 R_LOC101929574 -0.1270882884  rna    R
## 27                                      D_AGAP12P -0.1232940362  rna    D
## 28                              D_ENSG00000224830 -0.1193678222  rna    D
## 29                                        D_AQP10 -0.1122484325  rna    D
## 30                                    D_LINC01225 -0.1071813359  rna    D
## 31                                       D_ITGA2B -0.1065314054  rna    D
## 32                              D_ENSG00000253190 -0.0919722801  rna    D
## 33 D_palmitoyl.dihydrosphingomyelin..d18.0.16.0.. -0.0889315360  met    D
## 34                                       R_TMEM53 -0.0884433083  rna    R
## 35                                    D_LOC283194 -0.0842717209  rna    D
## 36                          R_X3.aminoisobutyrate -0.0797879842  met    R
## 37                                   R_GPRC5D.AS1 -0.0330370400  rna    R
## 38                         D_X.41BB._21_B.naives. -0.0003633828  cyt    D

We then focus on the features that are overexpressed in the secondary tolerant recipients (= the features that are most expressed on the RIGHT of the PC1). Therefore, the higher the feature value, the more it is expressed in secondary tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the highest features in the PC1 (over the 0.5 threshold) :

##                                                          feature    overexp
## 1                           R_X.CD38._16_T.population.Naive.TSCM 0.12862178
## 2                                                       R_PRUNE2 0.07642618
## 3                                              R_ENSG00000256658 0.06654757
## 4                                              D_ENSG00000223779 0.05141271
## 5  D_X.CD38._19_DN..86...CD8low.13...TSCM.like.FoxP3...30.à.90.. 0.04876548
## 6                                           R_X.CD38._29_DP.Treg 0.04364770
## 7                                     R_X2.palmitoyl.GPC..16.0.. 0.04208212
## 8                                                        R_HACD1 0.03878277
## 9                D_X.CD38._12_DN.T.cells..70...CD8..30...TEM.TCM 0.01299884
## 10                                             D_ENSG00000254851 0.01258701
##    orig type
## 1   cyt    R
## 2   rna    R
## 3   rna    R
## 4   rna    D
## 5   cyt    D
## 6   cyt    R
## 7   met    R
## 8   rna    R
## 9   cyt    D
## 10  rna    D
Features expressed in the second principal component

We first focus on the features that are overexpressed in the primary tolerant recipients (= the features that are most expressed at the bottom of the PC2). Therefore, the lower the feature value, the more it is expressed in primary tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the lowest features in the PC1 (under the -0.6 threshold) :

##                                           feature      overexp orig type
## 1                                     D_LOC283194 -0.278137942  rna    D
## 2  D_palmitoyl.dihydrosphingomyelin..d18.0.16.0.. -0.263279984  met    D
## 3                           R_X3.aminoisobutyrate -0.250888981  met    R
## 4                               D_ENSG00000224830 -0.181354872  rna    D
## 5                                         R_KLRC2 -0.180591196  rna    R
## 6                               D_ENSG00000253190 -0.151148478  rna    D
## 7                                     D_LINC01225 -0.150506710  rna    D
## 8                                       D_AGAP12P -0.085214455  rna    D
## 9                                          D_MEG3 -0.082746663  rna    D
## 10                              D_ENSG00000255438 -0.077030747  rna    D
## 11                                        R_AGMAT -0.076351538  rna    R
## 12                                       R_FAM83D -0.059376193  rna    R
## 13                              R_ENSG00000256658 -0.053429769  rna    R
## 14                         D_X.41BB._21_B.naives. -0.053213793  cyt    D
## 15                                         R_ASB2 -0.032953110  rna    R
## 16                                        D_TMCC2 -0.027053128  rna    D
## 17                              D_ENSG00000273599 -0.018696095  rna    D
## 18                                         R_MAPT -0.009674983  rna    R

We then focus on the features that are overexpressed in the secondary tolerant recipients (= the features that are most expressed on the top of the PC2). Therefore, the higher the feature value, the more it is expressed in secondary tolerant recipients.

We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the highest features in the PC1 (over the 0.5 threshold) :

##                                                          feature    overexp
## 1  D_X.CD38._19_DN..86...CD8low.13...TSCM.like.FoxP3...30.à.90.. 0.32050789
## 2                                           R_X.CD38._29_DP.Treg 0.26059146
## 3                                                       R_PRUNE2 0.25849965
## 4                D_X.CD38._12_DN.T.cells..70...CD8..30...TEM.TCM 0.25377195
## 5                                                        D_AQP10 0.18884454
## 6                           R_X.CD38._16_T.population.Naive.TSCM 0.18677021
## 7                                                        R_UBE2C 0.17710541
## 8                                     R_X2.palmitoyl.GPC..16.0.. 0.17058700
## 9                                                        R_HACD1 0.15616547
## 10                                                       R_MKI67 0.13184111
## 11                                                       R_PCLAF 0.13044641
## 12                                                      R_TMEM53 0.12851901
## 13                                                       R_MZT2B 0.12606710
## 14                                                      D_ITGA2B 0.12531169
## 15                                                      D_SYCP2L 0.12111976
## 16                                                     R_SNORD17 0.11861259
## 17                                                     R_CCDC157 0.11438887
## 18                                             D_ENSG00000254851 0.10846741
## 19                                             R_ENSG00000210107 0.09968309
## 20                                                        R_E2F8 0.09157717
## 21                                                  R_GPRC5D.AS1 0.08662232
## 22                                                       R_ASAP3 0.07198473
## 23                                             D_ENSG00000223779 0.06915841
## 24                                                       R_TOP2A 0.06142598
## 25                                                      R_CDC25A 0.05274195
## 26                                             R_ENSG00000280981 0.05257981
## 27                                                R_LOC101929574 0.03488910
## 28                                             R_ENSG00000233597 0.03312577
## 29                                             D_ENSG00000260425 0.02824863
## 30                                                D_LOC100506544 0.02129929
##    orig type
## 1   cyt    D
## 2   cyt    R
## 3   rna    R
## 4   cyt    D
## 5   rna    D
## 6   cyt    R
## 7   rna    R
## 8   met    R
## 9   rna    R
## 10  rna    R
## 11  rna    R
## 12  rna    R
## 13  rna    R
## 14  rna    D
## 15  rna    D
## 16  rna    R
## 17  rna    R
## 18  rna    D
## 19  rna    R
## 20  rna    R
## 21  rna    R
## 22  rna    R
## 23  rna    D
## 24  rna    R
## 25  rna    R
## 26  rna    R
## 27  rna    R
## 28  rna    R
## 29  rna    D
## 30  rna    D

Correlations between features

We can now see how the variables that we extracted from recipients and donors are coorelated. Correlations between variables are represented by edges in the graph. We kept only correlations that were common between the St Louis and the Cryostem cohort.

Separate data type analyses

For comparison, we can see if we would have managed to separate the tolerant from non tolerant recipients and couples as well without integrating the data:

CyTOF:

We perform PCA using only the 49 features that were selected as informative in the recipients and couples in both cohorts:

The first comoponent seems to be capturing most of the data variability:

We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC4 is the 2nd mostly correlated with tolerance.

The primary and secondary patients are well separatd in this PCA.

We can also have a look at the features that are driving these two components:

Metabolomics:

Variability captured by each component:

We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC2 is the 2nd mostly correlated with tolerance.

The primary and secondary tolerant patients are less well separated in this PCA.

Transcriptomics:

Variability captured by each component:

We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC5 is the 2nd mostly correlated with tolerance.

The primary tolerant patients are perfectly separated from the secondary tolerant patients in this PCA.